Benchmarks

TSFM benchmark archive

Independent leaderboards for time series foundation models, refreshed automatically from upstream sources. Each benchmark answers a different forecasting question — pick the one that matches your workload.

Which benchmark should you use?

FEV BenchSkill Score

How well does a model generalize zero-shot to unseen forecasting series?

Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.

GIFT-EvalAverage Rank

Which models stay strong across heterogeneous datasets and probabilistic settings?

Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.

BOOMCRPS

How do models behave on observability telemetry instead of academic datasets?

Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.

ImpermanentScaled CRPS / MASE

Does model performance hold up as real time passes and the data distribution shifts?

The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.

ARFBenchAccuracy

Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?

750 multiple-choice QA pairs built from 142 real Datadog observability time series and 63 production incidents, spanning eight question categories grouped into three difficulty tiers. Models receive a templated question, a metric description, and the plotted time series, and are scored on Accuracy and macro-F1 against expert-reviewed answers.

Frequently asked questions

What are TSFM benchmarks?
TSFM benchmarks are benchmark-specific pages on TSFM.ai that track how time series foundation models perform across zero-shot, probabilistic, observability, and temporal generalization workloads.
Which TSFM benchmark should I use first?
Start with FEV Bench for general zero-shot model selection, then cross-check GIFT-Eval for probabilistic robustness, BOOM for observability-style data, and Impermanent for temporal drift and live evaluation.
Do the TSFM benchmark pages include hosted model coverage?
Yes. Each benchmark page highlights which leaderboard entries map to models currently hosted on TSFM.ai and links out to the matching model surface when available.
How often are benchmark pages refreshed?
Live benchmark pages refresh automatically from their upstream leaderboard sources every 12 hours. Impermanent remains visible as an emerging benchmark while its machine-readable public feed stabilizes.
What is live-bench?
live-bench is TSFM.ai's continuously-running, leakage-free benchmark. It pushes every hosted TSFM through the inference service every hour against fresh sensor data, so rankings always reflect models on data they have never seen.

Related benchmark reading