Zero-shot leaderboard for forecasting on broad real-world time series.
Skill Score · higher is better
Benchmarks
Independent leaderboards for time series foundation models, refreshed automatically from upstream sources. Each benchmark answers a different forecasting question — pick the one that matches your workload.
Use this hub to compare TSFM benchmarks at a glance, then drill into benchmark-specific pages for methodology, leaderboard interpretation, hosted-model coverage, and related research.
Zero-shot leaderboard for forecasting on broad real-world time series.
Skill Score · higher is better
Broad benchmark for probabilistic forecasting across diverse datasets and settings.
Average Rank · lower is better
Observability benchmark for production telemetry and infrastructure metrics.
CRPS · lower is better
Each benchmark answers a different model-selection question. FEV Bench is the starting point for zero-shot forecasting, GIFT-Eval is the right check for probabilistic robustness across heterogeneous datasets, BOOM is the benchmark for observability-style telemetry, and Impermanent focuses on temporal generalization under live drift.
The benchmark-specific pages on TSFM.ai keep the same leaderboard design you see here, but add benchmark methodology, interpretation guidance, related blog posts, and direct links into hosted model surfaces where matching coverage exists.
How well does a model generalize zero-shot to unseen forecasting series?
Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.
Which models stay strong across heterogeneous datasets and probabilistic settings?
Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.
How do models behave on observability telemetry instead of academic datasets?
Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.
2026-02-10
2024-04-12
2024-11-05
2026-02-15
2025-05-08
2026-02-20