Benchmarks

Live snapshots for two benchmark leaders: FEV Bench and GIFT-Eval.

This page is auto-fed from public leaderboard sources so it stays current without manual edits. We normalize model names to the TSFM.ai hosted catalog and link rows to model detail pages when there is a verified match.

Hosted models indexed: 22Refresh policy: periodic server revalidationRead benchmark docs

How To Interpret These Benchmarks

Each benchmark captures different strengths. Use both before deciding your default routing policy.

FEV Bench

What it tests: Zero-shot forecasting quality across diverse real-world datasets, without per-dataset fine-tuning.

How to read: Start with Skill Score (higher is better), then check Win Rate to confirm consistency across datasets.

Where it helps: Best for choosing a strong default model for broad, mixed-domain production traffic.

GIFT-Eval

What it tests: Probabilistic forecasting robustness across dataset-frequency slices in both univariate and multivariate settings.

How to read: Start with Average Rank (lower is better), then compare Average WQL for distributional calibration quality.

Where it helps: Best for uncertainty-sensitive workloads where quantile quality and calibration matter.

FEV Bench

AutoGluon's Foundation model EVvaluation Benchmark for zero-shot time series forecasting.

Source: leaderboard

Raw: csv

FEV Bench evaluates how well foundation models forecast unseen time series without any fine-tuning. It tests models on a diverse collection of real-world datasets spanning retail, energy, finance, traffic, and nature domains at multiple frequencies (minutely through yearly). Models are ranked by Skill Score, a composite metric that measures accuracy relative to a naive seasonal baseline across all datasets. Win Rate tracks how often a model beats the majority of competitors on individual datasets, capturing consistency alongside average performance.

Metric: Skill Score (higher is better)29 datasetsSecondary: Win Rate
Methodology

Each model receives raw historical series and must produce probabilistic forecasts at the specified horizon with no training, fine-tuning, or dataset-specific hyperparameter selection. Results are aggregated using MASE (Mean Absolute Scaled Error) normalized against a seasonal naive baseline, then converted to a composite Skill Score.

RankModelSkill ScoreWin RateHosted
1Chronos-235.49688.071View model
2TimesFM-2.530.19875.071View model
3TiRex30.01276.893Not hosted
4Toto-1.028.21466.857View model
5TabPFN-TS27.65058.679Not hosted
6Moirai-2.027.25961.214View model
7Chronos-Bolt26.51560.786Not hosted
8Sundial-Base24.74653.393View model
9Stat. Ensemble15.65448.536Not hosted
10AutoARIMA11.24036.679Not hosted
11AutoTheta10.98834.929Not hosted
12AutoETS2.25933.250Not hosted

GIFT-Eval

Salesforce's General Time Series Forecasting Model Evaluation benchmark covering 23 datasets across 7 domains.

Source: leaderboard

Raw: csv

GIFT-Eval is a comprehensive benchmark designed to test foundation models on both univariate and multivariate forecasting tasks. It spans 23 datasets across 7 domains including energy, transport, nature, economics, web traffic, healthcare, and sales. Models are evaluated on probabilistic accuracy using Weighted Quantile Loss (WQL) and ranked across all dataset-frequency combinations. The Average Rank metric provides a robust, outlier-resistant measure of overall model quality — a model that consistently places in the top 3 across all datasets will have a lower average rank than one that dominates a few datasets but performs poorly on others.

Metric: Average Rank (lower is better)23 datasetsSecondary: Average WQL
Methodology

Models produce quantile forecasts at standard levels (0.1, 0.2, ..., 0.9) for each dataset. WQL measures how well predicted quantiles match the true distribution. Results are grouped by univariate vs multivariate settings and aggregated across all dataset-frequency pairs. The final ranking uses mean rank across all evaluation slices.

RankModelAverage RankAverage WQLHosted
1patch_tst5.6510.559Not hosted
2moirai_1.1_R_large_no_leak6.0590.597Not hosted
3i_transformer6.2500.625Not hosted
4tft6.7780.597Not hosted
5moirai_1.1_R_base_no_leak7.4360.646Not hosted
6moirai_1.1_R_small_no_leak7.6630.641Not hosted
7chronos_base8.4340.642View model
8chronos_large8.4360.634View model
9timesfm8.7020.690View model
10chronos-small9.2000.641View model
11tide11.5500.771Not hosted
12deepar12.0000.990Not hosted