FEV Bench
Zero-shot generalization
GIFT-Eval
Probabilistic and multiset robustness
BOOM
Infrastructure and observability forecasting
Impermanent
Temporal robustness and concept drift
ARFBench
Anomaly reasoning over observability telemetry