benchmarkingfev-benchzero-shotevaluationautogluon

FEV Bench: The Zero-Shot Forecasting Benchmark Explained

FEV Bench from AutoGluon evaluates foundation models on pure zero-shot forecasting across 29 diverse datasets. Here's how it works, what it measures, and why it matters.

T
TSFM.ai Team
February 10, 20266 min read

If you are evaluating time series foundation models for a production use case, one of the first questions is straightforward: how well does this model forecast data it has never seen before, with no fine-tuning? That is exactly what FEV Bench is designed to answer. Developed by the AutoGluon team at AWS and detailed in their research paper, the Foundation model EVvaluation Benchmark (FEV Bench) provides a standardized, contamination-aware evaluation of zero-shot forecasting ability across a broad collection of real-world time series.

We surface FEV Bench results directly on our benchmarks page alongside GIFT-Eval and BOOM, so understanding what the numbers represent — and what they do not — is important for making informed model choices.

What FEV Bench Evaluates

FEV Bench tests a model's ability to produce accurate probabilistic forecasts on unseen time series without any training, fine-tuning, or dataset-specific hyperparameter selection. The model receives raw historical values and a target forecast horizon, and must produce a forecast distribution. Nothing else.

This is the purest test of a foundation model's generalization: the model must rely entirely on patterns learned during pretraining to handle new domains, frequencies, and statistical properties. It is the scenario most relevant to practitioners who want to deploy a model across diverse data streams without building dataset-specific pipelines.

The Dataset Collection

FEV Bench evaluates on 29 datasets spanning multiple domains and temporal frequencies:

Domains covered include retail demand, energy consumption, transportation traffic, financial indicators, web traffic, weather and nature observations, and macroeconomic series. This breadth is intentional — a model that excels on retail demand data but fails on energy series is not genuinely general-purpose.

Frequencies range from minutely observations through yearly aggregates. This is a critical axis of variation because the statistical structure of time series changes fundamentally with sampling frequency. Minutely data is dominated by noise and short-range autocorrelation. Monthly and yearly data emphasizes trend and long-range seasonality. A strong foundation model must handle both ends of this spectrum. The question of how much historical context to provide is closely related — models that excel at short horizons may need different context lengths than those optimized for long-range forecasts.

Data contamination controls. A persistent problem in TSFM benchmarking is that models are often pretrained on the same public datasets used for evaluation, inflating zero-shot scores. Many draw from the Monash Forecasting Archive and GluonTS dataset collections. FEV Bench addresses this by explicitly tracking which datasets appear in which model's pretraining corpus, and reporting results that account for this overlap. We discuss this issue in depth in our post on benchmarking challenges. This contamination awareness is one of the benchmark's most valuable features, as it pushes the community toward honest evaluation.

Metrics: Skill Score and Win Rate

FEV Bench uses two complementary metrics to rank models.

Skill Score

The primary ranking metric is the Skill Score, which is derived from MASE (Mean Absolute Scaled Error). MASE normalizes forecast errors against a seasonal naive baseline — the simplest reasonable forecasting method that just repeats the last observed seasonal cycle. A MASE of 1.0 means the model performs exactly as well as the naive baseline. Below 1.0 means it outperforms the baseline; above 1.0 means it does worse.

The Skill Score converts MASE into a composite metric that aggregates performance across all 29 datasets. Higher is better. To put concrete numbers on this: a model with a Skill Score of 0.80 is reducing forecast error by 80% relative to the naive seasonal baseline across the benchmark suite. A Skill Score of 0.50 means 50% improvement. Differences of 0.02-0.03 between models are often within noise; differences above 0.05 are typically meaningful.

Why MASE rather than MAPE or RMSE? MASE has several desirable properties for cross-dataset comparison. It is scale-independent, so it can be meaningfully averaged across datasets with different magnitudes. It is well-defined even when series contain zero values (unlike MAPE, which divides by zero). And it provides an interpretable baseline comparison, answering the practical question: is this model actually better than the simplest alternative?

Win Rate

Win Rate captures a different dimension of performance: consistency. It measures how often a model outperforms the majority of other evaluated models on individual datasets. A model with a high Skill Score but low Win Rate is likely dominating a few datasets while underperforming on others. A model with both high Skill Score and high Win Rate is genuinely strong across the board.

This distinction matters in practice. If you are deploying a model across a heterogeneous portfolio of time series — which is the typical production scenario — you want a model that is reliably good, not one that is brilliant on retail data and mediocre on everything else.

Evaluation Protocol

The evaluation protocol is designed for strict reproducibility:

  1. No training or adaptation. Models receive only historical context and must forecast directly. No gradient updates, no in-context learning examples, no dataset-specific preprocessing beyond what the model handles internally.

  2. Standardized preprocessing. All models receive the same input format. Series are not differenced, deseasonalized, or otherwise transformed before being fed to the model. This tests the model's ability to handle raw data, which is how most practitioners will use it.

  3. Fixed horizons per dataset. Each dataset has a predefined forecast horizon that reflects its natural prediction task. The evaluation does not cherry-pick horizons where a particular model happens to excel.

  4. Probabilistic evaluation. Models produce full distributional forecasts, not just point predictions. This is important because probabilistic forecasts provide calibrated uncertainty estimates that are essential for downstream decision-making.

What FEV Bench Tells You (and What It Does Not)

It tells you which models generalize best out of the box across a wide range of real-world forecasting tasks. If you need to pick a single model for a diverse workload without fine-tuning, FEV Bench rankings are directly relevant.

It does not tell you how a model will perform after fine-tuning on your specific data. Some models that rank lower on zero-shot benchmarks respond exceptionally well to domain-specific adaptation. If you plan to fine-tune, zero-shot benchmarks are a starting point, not the final answer. Models like MOMENT and Moirai were designed with fine-tuning in mind; their zero-shot rankings do not capture their full potential in adapted settings.

It does not capture multivariate performance. FEV Bench focuses on univariate forecasting. If your use case involves forecasting with exogenous covariates or cross-series dependencies, you should also consider benchmarks like GIFT-Eval that evaluate multivariate settings.

It does not reflect production constraints like inference latency, memory footprint, or throughput. A model that ranks first on accuracy but takes 10x longer per forecast may not be the right choice for a real-time streaming application. For infrastructure considerations, see our post on scaling inference and GPU optimization. Lightweight models like Granite TTM may rank lower on FEV Bench but offer dramatically better latency characteristics.

How We Use FEV Bench at TSFM.ai

On our benchmarks page, we pull FEV Bench results directly from the public leaderboard and highlight which ranked models are available through our inference API. When a model you can run through TSFM.ai also ranks well on FEV Bench, you get a concrete signal: this model has been independently validated on diverse data by a rigorous third-party benchmark. You can explore all available models on our models catalog and cross-reference with benchmark rankings.

FEV Bench rankings also feed into our model routing heuristic. When a user submits a forecasting request and we need to select the best model automatically, zero-shot benchmark performance across domains and frequencies is one of the signals the router uses. It is not the only signal — latency requirements, series characteristics, and domain affinity all play a role — but generalization ability as measured by FEV Bench is a strong prior.

For practitioners choosing between models, we recommend starting with FEV Bench to establish a baseline understanding of zero-shot capability, then cross-referencing with domain-specific benchmarks like BOOM (for observability data) or GIFT-Eval (for multivariate and probabilistic evaluation) to get the complete picture.

Related articles