Live·29 benchmark rows·Auto-refreshed from the official leaderboard every 12 hours

FEV Bench

FEV Bench measures how well time series foundation models forecast unseen data with no fine-tuning. It is the cleanest answer to the baseline question most teams start with: which model is strongest when you just hand it history and ask for a forecast.

What this benchmark answers

How well does a model generalize zero-shot to unseen forecasting series?

Methodology

Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.

Chronos-2 leads with Skill Score 35.50, followed by TimesFM-2.5 (30.20) and TiRex (30.01).

7 of 12 ranked models hosted on TSFM.ai · Higher is better

Rankings

Skill Score vs Win Rate

Hosted (filled)Not hosted (outline)

Full results

# ModelSkill Score
135.50
230.20
330.01
428.21
5
T
TabPFN-TS
27.65
627.26
726.52
824.75
9
S
Stat. Ensemble
15.65
10
A
AutoARIMA
11.24
11
A
AutoTheta
10.99
12
A
AutoETS
2.26

Why zero-shot matters

Most production teams do not have the data, compute, or time to fine-tune a foundation model for every new series. Zero-shot performance tells you what you get out of the box — the floor you can build on. FEV Bench is the most direct measure of that floor because it evaluates models on datasets they have never seen during training.

Skill Score vs Win Rate

Skill Score measures how far a model beats the seasonal naive baseline, averaged across all datasets. Win Rate captures how often a model finishes first across individual benchmark slices. A model with a high Skill Score but a modest Win Rate is strong on average but inconsistent; the reverse means it wins often but by narrow margins. The best production pick usually scores well on both.

How to interpret it

  • Use FEV Bench when your main question is general-purpose zero-shot accuracy.
  • High Skill Score means the model beats a seasonal naive baseline by a wider margin.
  • Cross-check with domain-specific benchmarks before making a production choice.

Frequently asked questions

What is FEV Bench?
FEV Bench is a zero-shot forecasting benchmark that evaluates time series foundation models on 29 diverse real-world datasets without any fine-tuning. It ranks models by Skill Score and Win Rate.
How often is the FEV Bench leaderboard updated?
TSFM.ai refreshes the leaderboard automatically every 12 hours from the official upstream source hosted on Hugging Face.
What does Skill Score measure?
Skill Score quantifies how much better a model performs compared to a seasonal naive baseline. Higher values indicate a larger improvement over the baseline across the full benchmark surface.
Should I use FEV Bench to pick a model for production?
FEV Bench is a strong starting point for general-purpose forecasting, but you should cross-check with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for probabilistic robustness) before committing to a production choice.

Related reading

Compare with other TSFM benchmarks

GIFT-Eval

Which models stay strong across heterogeneous datasets and probabilistic settings?

BOOM

How do models behave on observability telemetry instead of academic datasets?

Impermanent

Does model performance hold up as real time passes and the data distribution shifts?

Sources