Evaluation & benchmarks
Evaluation and benchmarks
Understand how benchmark pages are sourced, interpreted, and used to compare models fairly.
Why these benchmark pages exist
Each benchmark answers a different practical question. FEV Bench handles general zero-shot accuracy, GIFT-Eval captures broader probabilistic robustness, BOOM speaks to observability workloads, and Impermanent extends the surface toward live temporal generalization.
Sources
| Benchmark | Scope | Ranking metric | Links |
|---|---|---|---|
| FEV Bench | Broad realistic forecasting benchmark with open leaderboard tables. | Primary ranking on Skill Score (higher is better). | Hugging Face Space: autogluon/fev-bench Data file: |
| GIFT-Eval | General time series benchmark covering diverse datasets and frequencies. | Primary ranking on aggregated average rank (lower is better). | Hugging Face Space: Salesforce/GIFT-Eval Data file: |
| BOOM | Observability and infrastructure telemetry benchmark with production monitoring characteristics. | Primary ranking on CRPS (lower is better). | Hugging Face Space: Datadog/BOOM Data file: |
| Impermanent | Live temporal-generalization benchmark on continuously updated GitHub activity streams. | Tracked with scaled CRPS and MASE in a prequential evaluation loop. | TimeCopilot Impermanent project surface Data file: |
How ranking is computed on TSFM.ai
- 1. Fetch upstream benchmark source material from each official provider.
- 2. Parse score columns and build ranked rows for each benchmark when a stable machine-readable feed exists.
- 3. Resolve leaderboard model names to hosted catalog IDs using normalized name matching and aliases.
- 4. Render benchmark detail pages with model deep-links, benchmark-specific metadata, and sitemap coverage.
Operational notes
- The benchmark hub and benchmark detail routes render server-side so benchmark terms have dedicated indexable URLs.
- Rows are normalized against TSFM.ai hosted model IDs so each supported model can deep-link into /models/{id}.
- Not every leaderboard model is hosted; those rows intentionally show `Not hosted`.
- Benchmark providers can change schemas over time. Keep parsing logic versioned and monitored.
- Impermanent is modeled with a deliberate fallback until the public leaderboard feed becomes stable enough to ingest automatically.