FEV Bench is a forecasting benchmark of 100 tasks from 96 real-world time-series datasets across 7 domains, including covariate-rich and multivariate settings. This leaderboard view shows MASE-based Skill Score for point-forecasting accuracy. Pretrained models are evaluated under the benchmark's zero-shot/leakage policy; task-specific models may train on each task's training split.

How often is the FEV Bench leaderboard updated?

TSFM.ai refreshes the leaderboard automatically every 12 hours from the official upstream source hosted on Hugging Face.

What does MASE Skill Score measure?

Skill Score quantifies how much better a model performs compared to a seasonal naive baseline, using MASE (Mean Absolute Scaled Error) as the base metric. Higher values indicate a larger improvement over the baseline across all 100 benchmark tasks.

Should I use FEV Bench to pick a model for production?

FEV Bench is a strong starting point for general-purpose point-forecasting accuracy, but you should cross-check with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for probabilistic robustness) before committing to a production choice.

Live·96 benchmark rows·Auto-refreshed from the official leaderboard every 12 hours

FEV Bench

FEV Bench evaluates time series foundation models across 100 forecasting tasks drawn from 96 real-world datasets spanning 7 domains, including 46 tasks with covariates. Pretrained models are evaluated under a zero-shot / leakage policy; task-specific models may train on each task's training split. This leaderboard view shows the full MASE ranking for point-forecasting accuracy.

What this benchmark answers

How well does a model generalize to unseen real-world forecasting tasks?

Methodology

Pretrained models run zero-shot under the benchmark's leakage policy. Task-specific models (classical baselines, fine-tuned models) may use each task's training split. All models are ranked by MASE-based Skill Score relative to a seasonal naive baseline; Win Rate shows how often a model leads individual task rankings.

Chronos-2 leads with MASE Skill Score 35.50, followed by TimesFM-2.5 (30.20) and TiRex (30.01).

9 of 21 ranked models hosted on TSFM.ai · Higher is better

Rankings

MASE Skill Score vs Win Rate

Hosted (filled)Not hosted (outline)

Full results

#	Model	MASE Skill Score	Win Rate	Inference (s/100)	Train overlap	Failures	Hosted
1	Chronos-2→	35.50	87.25	0.84	0%	0	View model
2	TimesFM-2.5→	30.20	76.10	1.87	10%	0	View model
3	TiRex→	30.01	77.00	0.24	1%	0	View model
4	TabPFN-TS	29.83	63.92	109.39	0%	1	—
5	Toto-1.0→	28.23	66.65	22.35	8%	0	View model
6	FlowState→	27.83	70.50	2.29	8%	0	View model
7	Moirai-2.0→	27.22	62.45	0.35	28%	0	View family
8	Chronos-Bolt→	26.52	61.30	0.26	0%	0	View family
9	Sundial-Base→	24.75	53.65	8.01	1%	0	View model
10	CatBoost	23.69	52.40	0.31	0%	0	—
11	LightGBM	21.69	49.35	0.27	0%	0	—
12	TFT	20.48	45.67	0.99	0%	0	—
13	PatchTST→	18.37	41.35	0.71	0%	0	View model
14	DeepAR	17.52	38.58	1.26	0%	3	—
15	Stat. Ensemble	16.44	47.70	146.94	0%	4	—
16	AutoARIMA	11.63	36.15	20.14	0%	4	—
17	AutoTheta	10.99	33.35	3.28	0%	0	—
18	AutoETS	2.26	33.07	3.48	0%	3	—
19	Seasonal Naive	0.00	20.00	0.48	0%	0	—
20	Naive	-16.67	18.20	0.47	0%	0	—
21	Drift	-18.14	15.35	0.45	0%	0	—

What FEV Bench measures

FEV Bench runs 100 forecasting tasks from 96 datasets across 7 domains, including 46 tasks with covariates. The benchmark separates pretrained foundation models — evaluated zero-shot under a leakage policy — from task-specific models that may train on each task's training split. Both tracks appear in the leaderboard; the training runtime column distinguishes them. Several pretrained models have declared partial training-corpus overlap with benchmark datasets; the benchmark imputes affected tasks rather than excluding models entirely.

MASE Skill Score vs Win Rate

This leaderboard ranks by MASE Skill Score: how far a model beats the seasonal naive baseline, averaged across all 100 tasks. Win Rate captures how often a model finishes first across individual tasks. FEV Bench also publishes SQL (scaled quantile loss) for probabilistic accuracy and WQL/WAPE as secondary metrics. A model with a high MASE Skill Score but a modest Win Rate is strong on average but inconsistent; the reverse means it wins often but by narrow margins.

How to interpret it

—This view shows the full MASE leaderboard. FEV Bench also reports SQL (probabilistic) and WQL/WAPE metrics.
—High MASE Skill Score means the model beats a seasonal naive baseline by a wider margin across all 100 tasks.
—Several pretrained models have declared training-corpus overlap with benchmark datasets; affected tasks are imputed per the benchmark's leakage policy.
—Cross-check with domain-specific benchmarks before making a production choice.

Frequently asked questions

What is FEV Bench?: FEV Bench is a forecasting benchmark of 100 tasks from 96 real-world time-series datasets across 7 domains, including covariate-rich and multivariate settings. This leaderboard view shows MASE-based Skill Score for point-forecasting accuracy. Pretrained models are evaluated under the benchmark's zero-shot/leakage policy; task-specific models may train on each task's training split.
How often is the FEV Bench leaderboard updated?: TSFM.ai refreshes the leaderboard automatically every 12 hours from the official upstream source hosted on Hugging Face.
What does MASE Skill Score measure?: Skill Score quantifies how much better a model performs compared to a seasonal naive baseline, using MASE (Mean Absolute Scaled Error) as the base metric. Higher values indicate a larger improvement over the baseline across all 100 benchmark tasks.
Should I use FEV Bench to pick a model for production?: FEV Bench is a strong starting point for general-purpose point-forecasting accuracy, but you should cross-check with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for probabilistic robustness) before committing to a production choice.

Title	Date
FEV Bench: The Zero-Shot Forecasting Benchmark Explained	2026-02-10	Read →
Zero-Shot Forecasting: Why It Matters	2024-04-12	Read →
The Challenges of Benchmarking TSFMs	2024-11-05	Read →