ARFBench (Anomaly Reasoning Framework Benchmark) is a multimodal benchmark from Datadog and CMU that evaluates how well models answer anomaly-reasoning questions about real observability time series. It contains 750 multiple-choice QA pairs built from 142 production time series and 63 real incidents.

How is ARFBench different from BOOM?

BOOM scores forecasting accuracy on observability telemetry (CRPS and MASE). ARFBench scores anomaly reasoning over the same kind of data via multiple-choice QA. The inputs overlap; the evaluation target does not.

Why are human baselines included on the leaderboard?

Domain experts, non-domain experts, and a model-expert oracle (best-of-two over a model and an expert answer) are included as upper-bound references. Domain experts outperform every individual model on the released leaderboard, and the oracle reaches 87.2% accuracy — useful as a ceiling when reasoning about deployable human-plus-model systems.

Can I run ARFBench on a TSFM hosted by TSFM.ai today?

Not yet. ARFBench requires a multimodal model that accepts plotted time series and answers structured questions. Datadog's Toto-1.0-QA-Experimental is the only specialized checkpoint released so far. We will surface a hosted-model overlay if and when the inference packaging stabilizes.

How often is the leaderboard refreshed?

TSFM.ai refetches the upstream Datadog ARFBench leaderboard CSV every 12 hours via Next.js ISR.

Live·750 sources · 142 series·Auto-refreshed from the official leaderboard every 12 hours

ARFBench

ARFBench (Anomaly Reasoning Framework Benchmark) evaluates whether models can answer incident-response questions over production observability data: is there an anomaly, when did it start, which channel moved, how large was it, and did this metric lead or lag another one? It is a multimodal reasoning benchmark, not a forecast leaderboard.

What this benchmark answers

Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?

Methodology

750 multiple-choice QA pairs built from 142 real Datadog observability time series and 63 production incidents, spanning eight question categories grouped into three difficulty tiers. Models receive a templated question, a metric description, and the plotted time series, and are scored on Accuracy and macro-F1 against expert-reviewed answers.

Model-Expert Oracle leads with Accuracy 87.20, followed by Domain Experts (n=2) (72.70) and Non-domain Experts (n=2) (69.70).

0 of 21 ranked models hosted on TSFM.ai · Higher is better

Rankings

Accuracy vs Overall F1

Hosted (filled)Not hosted (outline)

Full results

#	Model	Accuracy	Overall F1	Type	Tier I	Tier II	Tier III	Hosted
1	Model-Expert Oracle	87.20	82.80	Baseline	96.4	80.3	90.5	—
2	Domain Experts (n=2)	72.70	64.60	Baseline	89.3	67.7	71.4	—
3	Non-domain Experts (n=2)	69.70	60.70	Baseline	80.4	63.2	72.0	—
4	Toto-1.0-QA-Experimental 32B (TSFM-VLM)	63.90	48.90	Post-trained TSFM	84.7	55.6	64.6	—
5	GPT-5	62.70	51.90	VLM	82.0	55.9	62.5	—
6	GPT-5.4	61.30	51.40	VLM	81.1	54.2	61.3	—
7	Gemini 3 Pro	58.10	49.60	VLM	82.9	51.0	56.5	—
8	Qwen3-VL 32B (post-trained)	56.90	46.60	Post-trained TSFM	84.7	50.3	53.8	—
9	GPT-5 (text)	56.40	43.80	LLM	82.6	45.2	57.9	—
10	Claude Opus 4.6	54.80	46.70	VLM	88.3	52.3	45.9	—
11	Qwen3-VL 32B	52.80	45.10	VLM	80.2	46.7	49.2	—
12	Toto-1.0-Qwen3 32B (TSFM-LLM)	48.80	33.90	Post-trained TSFM	82.9	47.4	38.7	—
13	Qwen3 32B	47.90	36.10	LLM	80.9	35.1	48.6	—
14	GPT-4.1	47.90	44.00	VLM	80.2	50.3	34.8	—
15	Claude Sonnet 4.5	47.20	37.90	VLM	83.8	43.5	38.4	—
16	GPT-4o	47.20	42.40	VLM	79.3	49.0	34.8	—
17	Qwen3-VL 8B	45.30	34.70	VLM	80.2	40.8	37.8	—
18	Per-category Frequent Choice	45.10	17.30	Baseline	84.7	30.1	45.6	—
19	ChatTS 8B (TS-LLM)	31.10	22.10	Post-trained TSFM	60.4	26.5	25.5	—
20	Random Choice	24.50	22.50	Baseline	50.0	20.0	20.0	—
21	OpenTSLM 1B (TS-LLM)	0.80	1.20	Post-trained TSFM	0.0	2.0	0.0	—

Why anomaly reasoning is a different benchmark axis

Forecasting benchmarks score whether a model can predict the next window. ARFBench scores whether a model can read a chart, use the surrounding metric context, and answer a structured question about it — the kind of question an on-call engineer asks during an incident. A model can be strong on BOOM and weak here, or vice versa, because the evaluation target is different: calibrated forecasts versus structured reasoning over plotted telemetry.

The three difficulty tiers

Tier I (Presence) asks whether an anomaly exists at all. Tier II (Identification, Start Time, End Time, Magnitude, Categorization) asks the model to characterize one series or grouped channel. Tier III (Correlation, Leading/Lagging Indicator) asks the model to compare anomaly structure across paired series — the closest thing in the benchmark to root-cause reasoning. Tier III is where the gap between models and human experts is largest.

Frontier VLMs vs. specialized TSFM-VLMs

Frontier VLMs (GPT-5, Gemini 3 Pro) lead the model leaderboard, but Datadog's specialized Toto-1.0-QA-Experimental — a multimodal checkpoint built on Toto and Qwen3-VL — posts the highest model accuracy in the released CSV. The takeaway is not that one architecture wins. It is that adding a time-series modality to a VLM is a viable competitive path, and that operator reasoning is far from saturated.

How to interpret it

—Use ARFBench when your workload is incident response or operator copilots, not raw forecasting.
—Macro-F1 matters because answer choices are imbalanced — a frequent-choice baseline reaches 45% accuracy but only 17% F1.
—Domain experts (72.7% accuracy) still outperform the best evaluated model. The model-expert oracle reaches 87.2%, suggesting human-plus-model is the strongest deployable setup.
—Tier III (Correlation, Leading/Lagging) is where most models fall off — it is the closest thing to root-cause reasoning.

Frequently asked questions

What is ARFBench?: ARFBench (Anomaly Reasoning Framework Benchmark) is a multimodal benchmark from Datadog and CMU that evaluates how well models answer anomaly-reasoning questions about real observability time series. It contains 750 multiple-choice QA pairs built from 142 production time series and 63 real incidents.
How is ARFBench different from BOOM?: BOOM scores forecasting accuracy on observability telemetry (CRPS and MASE). ARFBench scores anomaly reasoning over the same kind of data via multiple-choice QA. The inputs overlap; the evaluation target does not.
Why are human baselines included on the leaderboard?: Domain experts, non-domain experts, and a model-expert oracle (best-of-two over a model and an expert answer) are included as upper-bound references. Domain experts outperform every individual model on the released leaderboard, and the oracle reaches 87.2% accuracy — useful as a ceiling when reasoning about deployable human-plus-model systems.
Can I run ARFBench on a TSFM hosted by TSFM.ai today?: Not yet. ARFBench requires a multimodal model that accepts plotted time series and answers structured questions. Datadog's Toto-1.0-QA-Experimental is the only specialized checkpoint released so far. We will surface a hosted-model overlay if and when the inference packaging stabilizes.
How often is the leaderboard refreshed?: TSFM.ai refetches the upstream Datadog ARFBench leaderboard CSV every 12 hours via Next.js ISR.

Title	Date
ARFBench: Time-Series Models Have to Answer Questions Now	2026-05-19	Read →
BOOM: Datadog's Observability Forecasting Benchmark	2026-02-20	Read →
TFRBench: The First Benchmark for Evaluating Reasoning in Forecasting Systems	2026-04-10	Read →
Toto: Datadog's Domain-Specific TSFM for Observability	2025-11-15	Read →