Live·750 sources · 142 series·Auto-refreshed from the official leaderboard every 12 hours

ARFBench

ARFBench (Anomaly Reasoning Framework Benchmark) evaluates whether models can answer incident-response questions over production observability data: is there an anomaly, when did it start, which channel moved, how large was it, and did this metric lead or lag another one? It is a multimodal reasoning benchmark, not a forecast leaderboard.

What this benchmark answers

Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?

Methodology

750 multiple-choice QA pairs built from 142 real Datadog observability time series and 63 production incidents, spanning eight question categories grouped into three difficulty tiers. Models receive a templated question, a metric description, and the plotted time series, and are scored on Accuracy and macro-F1 against expert-reviewed answers.

Model-Expert Oracle leads with Accuracy 87.20, followed by Domain Experts (n=2) (72.70) and Non-domain Experts (n=2) (69.70).

0 of 21 ranked models hosted on TSFM.ai · Higher is better

Rankings

Accuracy vs Overall F1

Hosted (filled)Not hosted (outline)

Full results

# ModelAccuracy
1
Model-Expert Oracle
87.20
2
Domain Experts (n=2)
72.70
3
Non-domain Experts (n=2)
69.70
4
Toto-1.0-QA-Experimental 32B (TSFM-VLM)
63.90
5
GPT-5
62.70
6
GPT-5.4
61.30
7
Gemini 3 Pro
58.10
8
Qwen3-VL 32B (post-trained)
56.90
9
GPT-5 (text)
56.40
10
Claude Opus 4.6
54.80
11
Qwen3-VL 32B
52.80
12
Toto-1.0-Qwen3 32B (TSFM-LLM)
48.80
13
Qwen3 32B
47.90
14
GPT-4.1
47.90
15
Claude Sonnet 4.5
47.20
16
GPT-4o
47.20
17
Qwen3-VL 8B
45.30
18
Per-category Frequent Choice
45.10
19
ChatTS 8B (TS-LLM)
31.10
20
Random Choice
24.50
21
OpenTSLM 1B (TS-LLM)
0.80

Why anomaly reasoning is a different benchmark axis

Forecasting benchmarks score whether a model can predict the next window. ARFBench scores whether a model can read a chart, use the surrounding metric context, and answer a structured question about it — the kind of question an on-call engineer asks during an incident. A model can be strong on BOOM and weak here, or vice versa, because the evaluation target is different: calibrated forecasts versus structured reasoning over plotted telemetry.

The three difficulty tiers

Tier I (Presence) asks whether an anomaly exists at all. Tier II (Identification, Start Time, End Time, Magnitude, Categorization) asks the model to characterize one series or grouped channel. Tier III (Correlation, Leading/Lagging Indicator) asks the model to compare anomaly structure across paired series — the closest thing in the benchmark to root-cause reasoning. Tier III is where the gap between models and human experts is largest.

Frontier VLMs vs. specialized TSFM-VLMs

Frontier VLMs (GPT-5, Gemini 3 Pro) lead the model leaderboard, but Datadog's specialized Toto-1.0-QA-Experimental — a multimodal checkpoint built on Toto and Qwen3-VL — posts the highest model accuracy in the released CSV. The takeaway is not that one architecture wins. It is that adding a time-series modality to a VLM is a viable competitive path, and that operator reasoning is far from saturated.

How to interpret it

  • Use ARFBench when your workload is incident response or operator copilots, not raw forecasting.
  • Macro-F1 matters because answer choices are imbalanced — a frequent-choice baseline reaches 45% accuracy but only 17% F1.
  • Domain experts (72.7% accuracy) still outperform the best evaluated model. The model-expert oracle reaches 87.2%, suggesting human-plus-model is the strongest deployable setup.
  • Tier III (Correlation, Leading/Lagging) is where most models fall off — it is the closest thing to root-cause reasoning.

Frequently asked questions

What is ARFBench?
ARFBench (Anomaly Reasoning Framework Benchmark) is a multimodal benchmark from Datadog and CMU that evaluates how well models answer anomaly-reasoning questions about real observability time series. It contains 750 multiple-choice QA pairs built from 142 production time series and 63 real incidents.
How is ARFBench different from BOOM?
BOOM scores forecasting accuracy on observability telemetry (CRPS and MASE). ARFBench scores anomaly reasoning over the same kind of data via multiple-choice QA. The inputs overlap; the evaluation target does not.
Why are human baselines included on the leaderboard?
Domain experts, non-domain experts, and a model-expert oracle (best-of-two over a model and an expert answer) are included as upper-bound references. Domain experts outperform every individual model on the released leaderboard, and the oracle reaches 87.2% accuracy — useful as a ceiling when reasoning about deployable human-plus-model systems.
Can I run ARFBench on a TSFM hosted by TSFM.ai today?
Not yet. ARFBench requires a multimodal model that accepts plotted time series and answers structured questions. Datadog's Toto-1.0-QA-Experimental is the only specialized checkpoint released so far. We will surface a hosted-model overlay if and when the inference packaging stabilizes.
How often is the leaderboard refreshed?
TSFM.ai refetches the upstream Datadog ARFBench leaderboard CSV every 12 hours via Next.js ISR.

Related reading

Compare with other TSFM benchmarks

FEV Bench

How well does a model generalize zero-shot to unseen forecasting series?

GIFT-Eval

Which models stay strong across heterogeneous datasets and probabilistic settings?

BOOM

How do models behave on observability telemetry instead of academic datasets?

Impermanent

Does model performance hold up as real time passes and the data distribution shifts?

Sources