How well does a model generalize zero-shot to unseen forecasting series?
GIFT-Eval Leaderboard
GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice. This page mirrors the official Salesforce GIFT-Eval leaderboard with search, filters, and per-slice rankings — covering every submitted model, not just the top few — and links each result back to a hosted endpoint on TSFM.ai where available.
TSOrchestra ranks first with an average MASE rank of 11.72 across the GIFT-Eval dataset surface.
What this benchmark answers
Which models stay strong across heterogeneous datasets and probabilistic settings?
Methodology
Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.
GIFT-Eval leaderboard
Showing 77 of 77 models · 34 hosted on TSFM.ai · Sorted by overall MASE rank (lower is better)
| # | Model | MASE rank |
|---|---|---|
| 1 | 11.72 | |
| 2 | DeOSAlphaTimeGPTPredictor-2025HF vencortex® | 12.03 |
| 3 | 13.00 | |
| 4 | 13.90 | |
| 5 | 14.07 | |
| 6 | Samay Kairosity | 16.11 |
| 7 | ZooCast-Top1Code | 16.38 |
| 8 | 17.04 | |
| 9 | Migas-1.0HF Synthefy | 17.09 |
| 10 | SynapseHF Google Cloud AI Research | 17.13 |
| 11 | 18.84 | |
| 12 | TSOrchestra-testHF Melady Lab @ USC | 20.38 |
| 13 | IBM TSFM & Rensselaer Polytechnic Institute | 21.70 |
| 14 | 21.81 | |
| 15 | IBM TSFM | 21.93 |
| 16 | Google Research | 22.06 |
| 17 | LongSeer-v1.0 LongShine AI Research | 22.94 |
| 18 | 24.06 | |
| 19 | Reverso MIT | 25.03 |
| 20 | IBM TSFM & Rensselaer Polytechnic Institute | 25.09 |
| 21 | Xihe-ultraHF Ant | 25.54 |
| 22 | 26.18 | |
| 23 | 26.21 | |
| 24 | VISIT-2.0 | 27.16 |
| 25 | Xihe-maxHF Ant | 28.00 |
| 26 | 29.08 | |
| 27 | 29.18 | |
| 28 | 29.38 | |
| 29 | A ShanghaiTech University | 30.93 |
| 30 | IBM Research | 31.38 |
| 31 | 31.81 | |
| 32 | TEMPO_ENSEMBLEHF Melady Lab @ USC | 33.28 |
| 33 | Datadog | 33.37 |
| 34 | A ShanghaiTech University | 33.70 |
| 35 | A ShanghaiTech University | 34.13 |
| 36 | IBM Research | 34.77 |
| 37 | AWS AI Labs | 35.00 |
| 38 | Google Research | 36.89 |
| 39 | Tsinghua University | 37.71 |
| 40 | Reverso-Nano MIT | 38.85 |
| 41 | 39.33 | |
| 42 | xLSTM-MixerHF AIML Lab @ TU Darmstadt | 39.44 |
| 43 | AWS AI Labs | 40.23 |
| 44 | Alibaba | 40.86 |
| 45 | Alibaba | 41.71 |
| 46 | 41.97 | |
| 47 | 42.42 | |
| 48 | Alibaba | 44.22 |
| 49 | Salesforce AI Research | 46.36 |
| 50 | Salesforce AI Research | 46.96 |
| 51 | Lingjiang Alibaba Cloud | 47.61 |
| 52 | AWS AI Labs | 48.66 |
| 53 | 48.88 | |
| 54 | Princeton University | 50.99 |
| 55 | 51.62 | |
| 56 | i_transformer Tsinghua University | 52.45 |
| 57 | Alibaba | 52.49 |
| 58 | 54.90 | |
| 59 | TFT Google Research | 54.98 |
| 60 | VisionTSHF Zhejiang University | 55.91 |
| 61 | Salesforce AI Research | 56.43 |
| 62 | N-BEATS ServiceNow | 57.16 |
| 63 | 57.63 | |
| 64 | FLAIR Mellon Inc. | 58.57 |
| 65 | Auto_Arima | 58.90 |
| 66 | IBM Research | 59.75 |
| 67 | IBM Research | 61.75 |
| 68 | 62.18 | |
| 69 | Auto_ETS | 62.48 |
| 70 | Auto_Theta | 62.75 |
| 71 | DLinear The Chinese University of Hong Kong | 62.92 |
| 72 | TIDE Google Research | 64.15 |
| 73 | DeepAR Amazon Research | 64.30 |
| 74 | Crossformer Shanghai Jiao Tong University | 64.60 |
| 75 | 67.07 | |
| 76 | 67.66 | |
| 77 | VISIT-1.0 | 67.84 |
Model landscape
GIFT-Eval is not only a TSFM leaderboard — it includes classical statistical baselines (ARIMA, ETS, Theta), deep-learning architectures (PatchTST, iTransformer, TFT), and agentic systems, so foundation-model results can be compared against the full prior art.
- Pretrained
- 17(22%)
- Zero-shot
- 32(42%)
- Fine-tuned
- 4(5%)
- Agentic
- 8(10%)
- Deep learning
- 10(13%)
- Statistical
- 6(8%)
Why robustness across datasets matters
A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.
Understanding Weighted Quantile Loss
WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.
How to interpret it
- —Lower Average Rank is better because the leaderboard aggregates placement across many slices.
- —WQL matters when forecast calibration and uncertainty quality matter to the business.
- —Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.
Frequently asked questions
- What is GIFT-Eval?
- GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.
- What does Average Rank mean in GIFT-Eval?
- Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
- When should I use GIFT-Eval instead of FEV Bench?
- Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
- Does GIFT-Eval test multivariate forecasting?
- GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.
- What is the 'test leak' column?
- Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.
- What model types appear on the leaderboard?
- GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.
- How often is the leaderboard refreshed?
- TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.
Related reading
Compare with other TSFM benchmarks
How do models behave on observability telemetry instead of academic datasets?
Does model performance hold up as real time passes and the data distribution shifts?