How well does a model generalize zero-shot to unseen forecasting series?
Models are evaluated zero-shot across diverse real-world datasets and ranked by Skill Score, with Win Rate showing how consistently they beat competitors across benchmark slices.
Which models stay strong across heterogeneous datasets and probabilistic settings?
Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.
How do models behave on observability telemetry instead of academic datasets?
Models are ranked on real observability time series using CRPS for probabilistic quality and MASE for point accuracy, aggregated across production telemetry collected from real infrastructure workloads.
Does model performance hold up as real time passes and the data distribution shifts?
The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.
Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?
750 multiple-choice QA pairs built from 142 real Datadog observability time series and 63 production incidents, spanning eight question categories grouped into three difficulty tiers. Models receive a templated question, a metric description, and the plotted time series, and are scored on Accuracy and macro-F1 against expert-reviewed answers.