Which models stay strong across heterogeneous datasets and probabilistic settings?
FEV Bench
FEV Bench measures how well time series foundation models forecast unseen data with no fine-tuning. It is the cleanest answer to the baseline question most teams start with: which model is strongest when you just hand it history and ask for a forecast.
What this benchmark answers
How well does a model generalize zero-shot to unseen forecasting series?
Methodology
Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.
Chronos-2 leads with Skill Score 35.50, followed by TimesFM-2.5 (30.20) and TiRex (30.01).
7 of 12 ranked models hosted on TSFM.ai · Higher is better
Rankings
Skill Score vs Win Rate
Full results
| # | Model | Skill Score |
|---|---|---|
| 1 | 35.50 | |
| 2 | 30.20 | |
| 3 | 30.01 | |
| 4 | 28.21 | |
| 5 | T TabPFN-TS | 27.65 |
| 6 | 27.26 | |
| 7 | 26.52 | |
| 8 | 24.75 | |
| 9 | S Stat. Ensemble | 15.65 |
| 10 | A AutoARIMA | 11.24 |
| 11 | A AutoTheta | 10.99 |
| 12 | A AutoETS | 2.26 |
Why zero-shot matters
Most production teams do not have the data, compute, or time to fine-tune a foundation model for every new series. Zero-shot performance tells you what you get out of the box — the floor you can build on. FEV Bench is the most direct measure of that floor because it evaluates models on datasets they have never seen during training.
Skill Score vs Win Rate
Skill Score measures how far a model beats the seasonal naive baseline, averaged across all datasets. Win Rate captures how often a model finishes first across individual benchmark slices. A model with a high Skill Score but a modest Win Rate is strong on average but inconsistent; the reverse means it wins often but by narrow margins. The best production pick usually scores well on both.
How to interpret it
- —Use FEV Bench when your main question is general-purpose zero-shot accuracy.
- —High Skill Score means the model beats a seasonal naive baseline by a wider margin.
- —Cross-check with domain-specific benchmarks before making a production choice.
Frequently asked questions
- What is FEV Bench?
- FEV Bench is a zero-shot forecasting benchmark that evaluates time series foundation models on 29 diverse real-world datasets without any fine-tuning. It ranks models by Skill Score and Win Rate.
- How often is the FEV Bench leaderboard updated?
- TSFM.ai refreshes the leaderboard automatically every 12 hours from the official upstream source hosted on Hugging Face.
- What does Skill Score measure?
- Skill Score quantifies how much better a model performs compared to a seasonal naive baseline. Higher values indicate a larger improvement over the baseline across the full benchmark surface.
- Should I use FEV Bench to pick a model for production?
- FEV Bench is a strong starting point for general-purpose forecasting, but you should cross-check with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for probabilistic robustness) before committing to a production choice.
Related reading
Compare with other TSFM benchmarks
How do models behave on observability telemetry instead of academic datasets?
Does model performance hold up as real time passes and the data distribution shifts?