benchmarkingevaluationresearch

The Challenges of Benchmarking TSFMs

Benchmarking time series foundation models is harder than it looks. Here's why results often conflict and what the field is doing about it.

T
TSFM.ai Team
November 5, 20245 min read

The Challenges of Benchmarking TSFMs

If you have read papers on time series foundation models, you have probably noticed something frustrating: every model claims state-of-the-art results, yet the rankings shuffle depending on which paper you are reading. Chronos leads on one benchmark suite, Moirai on another, TimesFM on a third. This is not necessarily because anyone is being dishonest. It is because benchmarking time series foundation models is genuinely difficult, and the field has not yet converged on standardized evaluation practices.

Understanding why this happens is important for practitioners who need to choose a model for their specific problem, and for researchers working to advance the field on solid footing.

Issue 1: Data Contamination

Foundation models are pretrained on large, diverse corpora of time series data. Many of these corpora draw from publicly available datasets: the Monash Forecasting Archive, GluonTS datasets, the UCI Machine Learning Repository, Kaggle competition data, and similar sources. The problem is that these are also the most commonly used evaluation benchmarks.

When a model is pretrained on data that overlaps with the test set, zero-shot evaluation results become unreliable. The model may have effectively memorized patterns from those specific series during pretraining. This is analogous to the data contamination problem in large language models, where benchmark performance may reflect exposure rather than generalization.

Some papers are transparent about this overlap. The Chronos paper, for instance, explicitly evaluates on both "in-domain" datasets (present in pretraining) and "zero-shot" datasets (held out from pretraining) and reports results separately. But not all papers make this distinction, and even when they do, readers often cite the aggregate numbers without the nuance.

Issue 2: Inconsistent Evaluation Protocols

There is no universal standard for how to evaluate a time series forecast. Different papers use different metrics, different forecast horizons, different train/test splitting strategies, and different preprocessing pipelines. The result is that even when two papers evaluate on the same dataset, their numbers may not be directly comparable.

Metrics. Some papers report MASE (Mean Absolute Scaled Error), which normalizes by the naive seasonal forecast. Others use MAPE (Mean Absolute Percentage Error), which breaks down on series with values near zero. Probabilistic models are evaluated with CRPS (Continuous Ranked Probability Score) or WQL (Weighted Quantile Loss), but the specific quantile levels vary. A model can look excellent on CRPS and mediocre on MASE for the same dataset.

Horizons. A model evaluated at a 24-step horizon may rank differently than the same model at a 96-step or 720-step horizon. Short-horizon accuracy depends heavily on recent local patterns, while long-horizon accuracy depends on capturing trend and seasonality. Papers that only report one horizon give an incomplete picture.

Splitting strategies. Some evaluations use a single fixed train/test split. Others use rolling-origin evaluation, where the model is tested on multiple consecutive windows. Rolling evaluation is more robust but computationally expensive, so it is not universally adopted.

Issue 3: Cherry-Picking and Reporting Bias

This is a human problem more than a technical one. Researchers naturally emphasize results where their model performs best. A paper might evaluate on 20 datasets but highlight the 8 where the proposed model ranks first. Supplementary tables contain the full picture, but readers often do not examine them closely.

Aggregation methods also matter. Reporting the mean rank across datasets gives a different picture than reporting the mean normalized error. A model that is slightly worse on most datasets but catastrophically bad on one will look fine by mean rank but terrible by mean error. The choice of aggregation is rarely discussed but can significantly influence which model appears to lead.

Issue 4: The Gap Between Academic and Real-World Data

Most forecasting benchmarks are curated academic datasets. They tend to be clean, regularly sampled, and well-behaved. Real-world time series data is messier: missing values are common, sampling intervals drift, distribution shifts occur when business conditions change, and metadata about frequency or domain is often absent or incorrect.

A model that excels on clean benchmark data may struggle when confronted with a retail sales series that has random gaps during holidays, a sensor stream that switches from 5-minute to 15-minute sampling partway through, or a financial series that undergoes a regime change after a regulatory shift. Very few benchmarks test for this kind of robustness.

What Good Benchmarking Looks Like

The field is making progress toward more rigorous evaluation. Several principles are emerging:

Held-out evaluation sets. Benchmark suites like GIFT-Eval, introduced in 2024, explicitly construct evaluation datasets that are unlikely to appear in any model's pretraining corpus. This addresses the contamination problem directly.

Standardized preprocessing and evaluation code. Rather than each paper implementing its own evaluation pipeline, shared codebases ensure that metrics are computed identically. The GluonTS evaluation module and the Nixtla benchmarking tools are steps in this direction.

Multiple metrics and horizons. Responsible evaluation reports results across several metrics (at minimum, a point accuracy metric and a probabilistic metric) and multiple forecast horizons. Single-number summaries are convenient but misleading.

Statistical significance testing. Reporting that Model A achieves MASE of 0.82 versus Model B's 0.84 is meaningless without knowing the variance. Nemenyi tests, Wilcoxon signed-rank tests, or bootstrap confidence intervals on rank differences help distinguish genuine improvements from noise.

Domain-stratified analysis. A model's aggregate performance can mask large domain-specific differences. Reporting results broken down by domain (retail, energy, finance, etc.) helps practitioners assess relevance to their use case.

The Monash Archive and Beyond

The Monash Forecasting Archive has served as the closest thing to a standard benchmark in time series forecasting, collecting datasets across multiple frequencies and domains. Its limitation is that many TSFMs now include Monash data in their pretraining, complicating zero-shot evaluation.

Newer benchmark efforts aim to address this. GIFT-Eval provides a curated, contamination-aware benchmark suite. TSBench focuses on systematic comparison under controlled conditions. As these efforts mature, the field will move toward more reliable model comparisons.

How TSFM.ai Approaches Evaluation

Internally, we maintain a rolling evaluation suite that tests all supported models on a combination of public benchmarks (with contamination status tracked) and proprietary held-out datasets from anonymized customer data. We evaluate across multiple horizons, report both point and probabilistic metrics, and flag cases where model rankings diverge significantly by domain. This is not to determine a single "best" model, but to build a routing heuristic that matches each forecasting problem to the model most likely to serve it well. The goal is practical accuracy, not leaderboard positions.

Related articles