FEV Bench evaluates how well foundation models forecast unseen time series without any fine-tuning. It tests models on a diverse collection of real-world datasets spanning retail, energy, finance, traffic, and nature domains at multiple frequencies (minutely through yearly). Models are ranked by Skill Score, a composite metric that measures accuracy relative to a naive seasonal baseline across all datasets. Win Rate tracks how often a model beats the majority of competitors on individual datasets, capturing consistency alongside average performance.
Methodology
Each model receives raw historical series and must produce probabilistic forecasts at the specified horizon with no training, fine-tuning, or dataset-specific hyperparameter selection. Results are aggregated using MASE (Mean Absolute Scaled Error) normalized against a seasonal naive baseline, then converted to a composite Skill Score.