Evaluation & benchmarks

Evaluation and benchmarks

Understand how benchmark pages are sourced, interpreted, and used to compare models fairly.

Why these benchmark pages exist

Each benchmark answers a different practical question. FEV Bench handles general zero-shot accuracy, GIFT-Eval captures broader probabilistic robustness, BOOM speaks to observability workloads, and Impermanent extends the surface toward live temporal generalization.

Sources

Benchmark	Scope	Ranking metric	Links
FEV Bench	Broad realistic forecasting benchmark with open leaderboard tables.	Primary ranking on Skill Score (higher is better).	Hugging Face Space: autogluon/fev-bench Data file: `tables/full/leaderboard_MASE.csv` Leaderboard Raw CSV
GIFT-Eval	General time series benchmark covering diverse datasets and frequencies.	Primary ranking on aggregated average rank (lower is better).	Hugging Face Space: Salesforce/GIFT-Eval Data file: `results/grouped_results_by_univariate.csv` Leaderboard Raw CSV
BOOM	Observability and infrastructure telemetry benchmark with production monitoring characteristics.	Primary ranking on CRPS (lower is better).	Hugging Face Space: Datadog/BOOM Data file: `results/leaderboards/BOOM_leaderboard.csv` Leaderboard Raw CSV
Impermanent	Live temporal-generalization benchmark on continuously updated GitHub activity streams.	Tracked with scaled CRPS and MASE in a prequential evaluation loop.	TimeCopilot Impermanent project surface Data file: `Public leaderboard feed still stabilizing` Leaderboard Raw CSV

How ranking is computed on TSFM.ai

1. Fetch upstream benchmark source material from each official provider.
2. Parse score columns and build ranked rows for each benchmark when a stable machine-readable feed exists.
3. Resolve leaderboard model names to hosted catalog IDs using normalized name matching and aliases.
4. Render benchmark detail pages with model deep-links, benchmark-specific metadata, and sitemap coverage.

Operational notes

The benchmark hub and benchmark detail routes render server-side so benchmark terms have dedicated indexable URLs.
Rows are normalized against TSFM.ai hosted model IDs so each supported model can deep-link into /models/{id}.
Not every leaderboard model is hosted; those rows intentionally show `Not hosted`.
Benchmark providers can change schemas over time. Keep parsing logic versioned and monitored.
Impermanent is modeled with a deliberate fallback until the public leaderboard feed becomes stable enough to ingest automatically.

Live benchmarks page

Hosted model catalog

Choosing a model

Why these benchmark pages exist

How ranking is computed on TSFM.ai

Operational notes

Related pages