What is the BOOM benchmark?

BOOM is a forecasting benchmark by Datadog that evaluates time series foundation models on 2,807 real observability time series from production infrastructure. It uses CRPS and MASE as primary and secondary metrics.

Why does BOOM rank models differently from FEV Bench?

BOOM uses production telemetry data with heavy tails, regime shifts, and irregular sampling — characteristics rarely found in academic datasets. Models optimized for clean series often struggle on this workload, leading to significant ranking differences.

What does CRPS measure?

CRPS (Continuous Ranked Probability Score) evaluates the full predictive distribution, rewarding models that produce calibrated uncertainty estimates. Lower CRPS is better.

Is BOOM relevant if I am not in observability?

If your data has similar characteristics to production telemetry — spiky, heavy-tailed, with regime changes — BOOM rankings can be more informative than general-purpose leaderboards even outside of observability use cases.

Live·2,807 benchmark rows·Auto-refreshed from the official leaderboard every 12 hours

BOOM

BOOM answers a different question from the general-purpose leaderboards: how do TSFMs behave on messy production monitoring data with spikes, heavy tails, and regime shifts? For observability workloads, this is the benchmark that actually looks like the job.

What this benchmark answers

How do models behave on observability telemetry instead of academic datasets?

Methodology

Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.

toto_2.0_2.5b ranks first with CRPS 0.35, followed by toto_2.0_1b (0.35) and toto_2.0_313m (0.35).

15 of 23 ranked models hosted on TSFM.ai · Lower is better

Rankings

CRPS vs MASE

Hosted (filled)Not hosted (outline)

Full results

#	Model	CRPS	MASE	Hosted
1	toto_2.0_2.5b	0.35	0.58	—
2	toto_2.0_1b→	0.35	0.58	View model
3	toto_2.0_313m→	0.35	0.58	View model
4	toto_2.0_22m→	0.36	0.60	View model
5	Toto-Open-Base-1.0→	0.38	0.62	View model
6	toto_2.0_4m→	0.38	0.62	View model
7	timesfm_2_5_200m→	0.38	0.63	View model
8	chronos_2→	0.39	0.64	View model
9	moirai_2_small	0.43	0.67	—
10	moirai_1.1_base→	0.43	0.71	View model
11	moirai_1.1_large→	0.44	0.72	View model
12	moirai_1.1_small→	0.44	0.73	View model
13	timesfm_2_0_500m→	0.45	0.72	View model
14	chronos_bolt_base→	0.45	0.73	View model
15	chronos_bolt_small→	0.46	0.73	View model
16	timer	0.64	0.80	—
17	time-moe-200M→	0.64	0.88	View model
18	time-moe-50M→	0.65	0.81	View model
19	visionts	0.67	0.99	—
20	autoarima	0.74	0.82	—
21	seasonalnaive	1.00	1.00	—
22	autotheta	1.02	1.12	—
23	autoets	1.98	0.84	—

Why observability data is different

Production monitoring series have characteristics that academic benchmarks rarely capture: heavy-tailed distributions from traffic spikes, regime shifts when deployments change system behavior, irregular sampling from agent collection windows, and strong multivariate coupling between infrastructure metrics. BOOM uses 2,807 real series from Datadog across infrastructure, networking, database, security, and application domains to stress-test models on data that looks like what SRE teams actually see.

CRPS vs MASE in BOOM

CRPS (Continuous Ranked Probability Score) evaluates the full predictive distribution — it rewards models that produce well-calibrated uncertainty, not just sharp point estimates. MASE complements this with a scale-free point accuracy measure. If a model ranks well on CRPS but poorly on MASE, it is producing good uncertainty bands but inaccurate central forecasts; the reverse means sharp predictions with unreliable confidence intervals.

How to interpret it

—BOOM is the benchmark to prioritize when your inputs look like production telemetry.
—CRPS rewards calibrated uncertainty, not just sharp point estimates.
—Large ranking changes between BOOM and general-purpose benchmarks are a useful routing signal.

Frequently asked questions

What is the BOOM benchmark?: BOOM is a forecasting benchmark by Datadog that evaluates time series foundation models on 2,807 real observability time series from production infrastructure. It uses CRPS and MASE as primary and secondary metrics.
Why does BOOM rank models differently from FEV Bench?: BOOM uses production telemetry data with heavy tails, regime shifts, and irregular sampling — characteristics rarely found in academic datasets. Models optimized for clean series often struggle on this workload, leading to significant ranking differences.
What does CRPS measure?: CRPS (Continuous Ranked Probability Score) evaluates the full predictive distribution, rewarding models that produce calibrated uncertainty estimates. Lower CRPS is better.
Is BOOM relevant if I am not in observability?: If your data has similar characteristics to production telemetry — spiky, heavy-tailed, with regime changes — BOOM rankings can be more informative than general-purpose leaderboards even outside of observability use cases.

Title	Date
BOOM: Datadog's Observability Forecasting Benchmark	2026-02-20	Read →
Toto: Datadog's Domain-Specific TSFM for Observability	2025-11-15	Read →
Building Production Forecast Pipelines with TSFMs	2024-08-30	Read →