How well does a model generalize zero-shot to unseen forecasting series?
BOOM
BOOM answers a different question from the general-purpose leaderboards: how do TSFMs behave on messy production monitoring data with spikes, heavy tails, and regime shifts? For observability workloads, this is the benchmark that actually looks like the job.
What this benchmark answers
How do models behave on observability telemetry instead of academic datasets?
Methodology
Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.
Toto-Open-Base-1.0 ranks first with CRPS 0.38, followed by moirai_1.1_base (0.43) and moirai_1.1_large (0.44).
8 of 15 ranked models hosted on TSFM.ai · Lower is better
Rankings
CRPS vs MASE
Full results
| # | Model | CRPS |
|---|---|---|
| 1 | 0.38 | |
| 2 | 0.43 | |
| 3 | 0.44 | |
| 4 | 0.44 | |
| 5 | 0.45 | |
| 6 | 0.45 | |
| 7 | 0.46 | |
| 8 | timer | 0.64 |
| 9 | 0.64 | |
| 10 | time-moe-50M | 0.65 |
| 11 | V visionts | 0.67 |
| 12 | A autoarima | 0.74 |
| 13 | S seasonalnaive | 1.00 |
| 14 | A autotheta | 1.02 |
| 15 | A autoets | 1.98 |
Why observability data is different
Production monitoring series have characteristics that academic benchmarks rarely capture: heavy-tailed distributions from traffic spikes, regime shifts when deployments change system behavior, irregular sampling from agent collection windows, and strong multivariate coupling between infrastructure metrics. BOOM uses 2,807 real series from Datadog across infrastructure, networking, database, security, and application domains to stress-test models on data that looks like what SRE teams actually see.
CRPS vs MASE in BOOM
CRPS (Continuous Ranked Probability Score) evaluates the full predictive distribution — it rewards models that produce well-calibrated uncertainty, not just sharp point estimates. MASE complements this with a scale-free point accuracy measure. If a model ranks well on CRPS but poorly on MASE, it is producing good uncertainty bands but inaccurate central forecasts; the reverse means sharp predictions with unreliable confidence intervals.
How to interpret it
- —BOOM is the benchmark to prioritize when your inputs look like production telemetry.
- —CRPS rewards calibrated uncertainty, not just sharp point estimates.
- —Large ranking changes between BOOM and general-purpose benchmarks are a useful routing signal.
Frequently asked questions
- What is the BOOM benchmark?
- BOOM is a forecasting benchmark by Datadog that evaluates time series foundation models on 2,807 real observability time series from production infrastructure. It uses CRPS and MASE as primary and secondary metrics.
- Why does BOOM rank models differently from FEV Bench?
- BOOM uses production telemetry data with heavy tails, regime shifts, and irregular sampling — characteristics rarely found in academic datasets. Models optimized for clean series often struggle on this workload, leading to significant ranking differences.
- What does CRPS measure?
- CRPS (Continuous Ranked Probability Score) evaluates the full predictive distribution, rewarding models that produce calibrated uncertainty estimates. Lower CRPS is better.
- Is BOOM relevant if I am not in observability?
- If your data has similar characteristics to production telemetry — spiky, heavy-tailed, with regime changes — BOOM rankings can be more informative than general-purpose leaderboards even outside of observability use cases.
Related reading
Compare with other TSFM benchmarks
Which models stay strong across heterogeneous datasets and probabilistic settings?
Does model performance hold up as real time passes and the data distribution shifts?