How well does a model generalize zero-shot to unseen forecasting series?
GIFT-Eval
GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice.
What this benchmark answers
Which models stay strong across heterogeneous datasets and probabilistic settings?
Methodology
Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.
patch_tst ranks first with Average Rank 5.65, followed by moirai_1.1_R_large_no_leak (6.06) and i_transformer (6.25).
6 of 12 ranked models hosted on TSFM.ai · Lower is better
Rankings
Average Rank vs Average WQL
Full results
| # | Model | Average Rank |
|---|---|---|
| 1 | 5.65 | |
| 2 | 6.06 | |
| 3 | I i_transformer | 6.25 |
| 4 | T tft | 6.78 |
| 5 | 7.44 | |
| 6 | 7.66 | |
| 7 | chronos_base | 8.43 |
| 8 | 8.44 | |
| 9 | 8.70 | |
| 10 | chronos-small | 9.20 |
| 11 | T tide | 11.55 |
| 12 | D deepar | 12.00 |
Why robustness across datasets matters
A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.
Understanding Weighted Quantile Loss
WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.
How to interpret it
- —Lower Average Rank is better because the leaderboard aggregates placement across many slices.
- —WQL matters when forecast calibration and uncertainty quality matter to the business.
- —Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.
Frequently asked questions
- What is GIFT-Eval?
- GIFT-Eval is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups. It ranks models by Average Rank and Average Weighted Quantile Loss.
- What does Average Rank mean in GIFT-Eval?
- Average Rank is the mean placement of a model across all benchmark slices. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
- When should I use GIFT-Eval instead of FEV Bench?
- Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
- Does GIFT-Eval test multivariate forecasting?
- The TSFM.ai surface currently tracks the univariate grouping from GIFT-Eval. The benchmark also includes multivariate slices upstream; multivariate coverage may be added in a future update.
Related reading
Compare with other TSFM benchmarks
How do models behave on observability telemetry instead of academic datasets?
Does model performance hold up as real time passes and the data distribution shifts?