Benchmarks

TSFM benchmark tracker

Independent leaderboards for time series foundation models, refreshed automatically from upstream sources. Each benchmark answers a different forecasting question — pick the one that matches your workload.

Use this hub to compare TSFM benchmarks at a glance, then drill into benchmark-specific pages for methodology, leaderboard interpretation, hosted-model coverage, and related research.

LiveZero-shot generalization

Zero-shot leaderboard for forecasting on broad real-world time series.

1Chronos-2
35.50
2TimesFM-2.5
30.20
3TiRex
30.01
4Toto-1.0
28.21

Skill Score · higher is better

LiveProbabilistic and multiset robustness

Broad benchmark for probabilistic forecasting across diverse datasets and settings.

1patch_tst
5.65
2moirai_1.1_R_large_no_leak
6.06
3i_transformer
6.25
4tft
6.78

Average Rank · lower is better

LiveInfrastructure and observability forecasting

Observability benchmark for production telemetry and infrastructure metrics.

1Toto-Open-Base-1.0
0.38
2moirai_1.1_base
0.43
3moirai_1.1_large
0.44
4moirai_1.1_small
0.44

CRPS · lower is better

How to use these benchmark pages

Each benchmark answers a different model-selection question. FEV Bench is the starting point for zero-shot forecasting, GIFT-Eval is the right check for probabilistic robustness across heterogeneous datasets, BOOM is the benchmark for observability-style telemetry, and Impermanent focuses on temporal generalization under live drift.

The benchmark-specific pages on TSFM.ai keep the same leaderboard design you see here, but add benchmark methodology, interpretation guidance, related blog posts, and direct links into hosted model surfaces where matching coverage exists.

Which benchmark should you use?

FEV BenchSkill Score

How well does a model generalize zero-shot to unseen forecasting series?

Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.

GIFT-EvalAverage Rank

Which models stay strong across heterogeneous datasets and probabilistic settings?

Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.

BOOMCRPS

How do models behave on observability telemetry instead of academic datasets?

Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.

Frequently asked questions

What are TSFM benchmarks?
TSFM benchmarks are benchmark-specific pages on TSFM.ai that track how time series foundation models perform across zero-shot, probabilistic, observability, and temporal generalization workloads.
Which TSFM benchmark should I use first?
Start with FEV Bench for general zero-shot model selection, then cross-check GIFT-Eval for probabilistic robustness, BOOM for observability-style data, and Impermanent for temporal drift and live evaluation.
Do the TSFM benchmark pages include hosted model coverage?
Yes. Each benchmark page highlights which leaderboard entries map to models currently hosted on TSFM.ai and links out to the matching model surface when available.
How often are benchmark pages refreshed?
Live benchmark pages refresh automatically from their upstream leaderboard sources every 12 hours. Impermanent remains visible as an emerging benchmark while its machine-readable public feed stabilizes.

Related benchmark reading