Which models stay strong across heterogeneous datasets and probabilistic settings?
FEV Bench
FEV Bench evaluates time series foundation models across 100 forecasting tasks drawn from 96 real-world datasets spanning 7 domains, including 46 tasks with covariates. Pretrained models are evaluated under a zero-shot / leakage policy; task-specific models may train on each task's training split. This leaderboard view shows the full MASE ranking for point-forecasting accuracy.
What this benchmark answers
How well does a model generalize to unseen real-world forecasting tasks?
Methodology
Pretrained models run zero-shot under the benchmark's leakage policy. Task-specific models (classical baselines, fine-tuned models) may use each task's training split. All models are ranked by MASE-based Skill Score relative to a seasonal naive baseline; Win Rate shows how often a model leads individual task rankings.
Chronos-2 leads with MASE Skill Score 35.50, followed by TimesFM-2.5 (30.20) and TiRex (30.01).
9 of 21 ranked models hosted on TSFM.ai · Higher is better
Rankings
MASE Skill Score vs Win Rate
Full results
| # | Model | MASE Skill Score |
|---|---|---|
| 1 | 35.50 | |
| 2 | 30.20 | |
| 3 | 30.01 | |
| 4 | TabPFN-TS | 29.83 |
| 5 | 28.23 | |
| 6 | 27.83 | |
| 7 | 27.22 | |
| 8 | 26.52 | |
| 9 | 24.75 | |
| 10 | CatBoost | 23.69 |
| 11 | LightGBM | 21.69 |
| 12 | TFT | 20.48 |
| 13 | 18.37 | |
| 14 | DeepAR | 17.52 |
| 15 | Stat. Ensemble | 16.44 |
| 16 | AutoARIMA | 11.63 |
| 17 | AutoTheta | 10.99 |
| 18 | AutoETS | 2.26 |
| 19 | Seasonal Naive | 0.00 |
| 20 | Naive | -16.67 |
| 21 | Drift | -18.14 |
What FEV Bench measures
FEV Bench runs 100 forecasting tasks from 96 datasets across 7 domains, including 46 tasks with covariates. The benchmark separates pretrained foundation models — evaluated zero-shot under a leakage policy — from task-specific models that may train on each task's training split. Both tracks appear in the leaderboard; the training runtime column distinguishes them. Several pretrained models have declared partial training-corpus overlap with benchmark datasets; the benchmark imputes affected tasks rather than excluding models entirely.
MASE Skill Score vs Win Rate
This leaderboard ranks by MASE Skill Score: how far a model beats the seasonal naive baseline, averaged across all 100 tasks. Win Rate captures how often a model finishes first across individual tasks. FEV Bench also publishes SQL (scaled quantile loss) for probabilistic accuracy and WQL/WAPE as secondary metrics. A model with a high MASE Skill Score but a modest Win Rate is strong on average but inconsistent; the reverse means it wins often but by narrow margins.
How to interpret it
- —This view shows the full MASE leaderboard. FEV Bench also reports SQL (probabilistic) and WQL/WAPE metrics.
- —High MASE Skill Score means the model beats a seasonal naive baseline by a wider margin across all 100 tasks.
- —Several pretrained models have declared training-corpus overlap with benchmark datasets; affected tasks are imputed per the benchmark's leakage policy.
- —Cross-check with domain-specific benchmarks before making a production choice.
Frequently asked questions
- What is FEV Bench?
- FEV Bench is a forecasting benchmark of 100 tasks from 96 real-world time-series datasets across 7 domains, including covariate-rich and multivariate settings. This leaderboard view shows MASE-based Skill Score for point-forecasting accuracy. Pretrained models are evaluated under the benchmark's zero-shot/leakage policy; task-specific models may train on each task's training split.
- How often is the FEV Bench leaderboard updated?
- TSFM.ai refreshes the leaderboard automatically every 12 hours from the official upstream source hosted on Hugging Face.
- What does MASE Skill Score measure?
- Skill Score quantifies how much better a model performs compared to a seasonal naive baseline, using MASE (Mean Absolute Scaled Error) as the base metric. Higher values indicate a larger improvement over the baseline across all 100 benchmark tasks.
- Should I use FEV Bench to pick a model for production?
- FEV Bench is a strong starting point for general-purpose point-forecasting accuracy, but you should cross-check with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for probabilistic robustness) before committing to a production choice.
Related reading
Compare with other TSFM benchmarks
How do models behave on observability telemetry instead of academic datasets?
Does model performance hold up as real time passes and the data distribution shifts?
Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?