Evaluation & benchmarks

Evaluation and benchmarks

Understand how benchmark pages are sourced, interpreted, and used to compare models fairly.

Why these benchmark pages exist

Each benchmark answers a different practical question. FEV Bench handles general zero-shot accuracy, GIFT-Eval captures broader probabilistic robustness, BOOM speaks to observability workloads, and Impermanent extends the surface toward live temporal generalization.

Sources

BenchmarkScopeRanking metricLinks
FEV BenchBroad realistic forecasting benchmark with open leaderboard tables.Primary ranking on Skill Score (higher is better).

Hugging Face Space: autogluon/fev-bench

Data file: tables/full/leaderboard_MASE.csv

LeaderboardRaw CSV
GIFT-EvalGeneral time series benchmark covering diverse datasets and frequencies.Primary ranking on aggregated average rank (lower is better).

Hugging Face Space: Salesforce/GIFT-Eval

Data file: results/grouped_results_by_univariate.csv

LeaderboardRaw CSV
BOOMObservability and infrastructure telemetry benchmark with production monitoring characteristics.Primary ranking on CRPS (lower is better).

Hugging Face Space: Datadog/BOOM

Data file: results/leaderboards/BOOM_leaderboard.csv

LeaderboardRaw CSV
ImpermanentLive temporal-generalization benchmark on continuously updated GitHub activity streams.Tracked with scaled CRPS and MASE in a prequential evaluation loop.

TimeCopilot Impermanent project surface

Data file: Public leaderboard feed still stabilizing

LeaderboardRaw CSV

How ranking is computed on TSFM.ai

  1. 1. Fetch upstream benchmark source material from each official provider.
  2. 2. Parse score columns and build ranked rows for each benchmark when a stable machine-readable feed exists.
  3. 3. Resolve leaderboard model names to hosted catalog IDs using normalized name matching and aliases.
  4. 4. Render benchmark detail pages with model deep-links, benchmark-specific metadata, and sitemap coverage.

Operational notes

  • The benchmark hub and benchmark detail routes render server-side so benchmark terms have dedicated indexable URLs.
  • Rows are normalized against TSFM.ai hosted model IDs so each supported model can deep-link into /models/{id}.
  • Not every leaderboard model is hosted; those rows intentionally show `Not hosted`.
  • Benchmark providers can change schemas over time. Keep parsing logic versioned and monitored.
  • Impermanent is modeled with a deliberate fallback until the public leaderboard feed becomes stable enough to ingest automatically.

Related pages