How well does a model generalize to unseen real-world forecasting tasks?
GIFT-Eval Leaderboard
GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice. This page mirrors the official Salesforce GIFT-Eval leaderboard with search, filters, and per-slice rankings — covering every submitted model, not just the top few — and links each result back to a hosted endpoint on TSFM.ai where available.
Cobra-Agent ranks first with an average MASE rank of 14.63 across the GIFT-Eval dataset surface.
What this benchmark answers
Which models stay strong across heterogeneous datasets and probabilistic settings?
Methodology
Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.
GIFT-Eval leaderboard
Showing 94 of 94 models · 37 hosted on TSFM.ai · Sorted by overall MASE rank (lower is better)
| # | Model | MASE rank |
|---|---|---|
| 1 | Cobra-AgentHF Dalpha AI | 14.63 |
| 2 | 16.45 | |
| 3 | 16.62 | |
| 4 | 17.58 | |
| 5 | 18.29 | |
| 6 | 18.41 | |
| 7 | DeOSAlphaTimeGPTPredictor-2025HF vencortex® | 19.03 |
| 8 | 20.18 | |
| 9 | 20.34 | |
| 10 | 21.90 | |
| 11 | 22.02 | |
| 12 | Samay Kairosity | 24.41 |
| 13 | 25.40 | |
| 14 | ZooCast-Top1Code | 25.46 |
| 15 | Migas-1.0HF Synthefy | 25.67 |
| 16 | STRIDE (+Chronos-2)HF Google Cloud AI Research | 25.75 |
| 17 | SynapseHF Google Cloud AI Research | 26.01 |
| 18 | 26.01 | |
| 19 | STRIDE (+Timer-S1)HF Google Cloud AI Research | 26.41 |
| 20 | Datadog | 26.70 |
| 21 | Datadog | 27.68 |
| 22 | 28.33 | |
| 23 | Prism Birla AI Labs | 29.68 |
| 24 | 30.09 | |
| 25 | TSOrchestra-testHF Melady Lab @ USC | 30.18 |
| 26 | IBM TSFM & Rensselaer Polytechnic Institute | 31.36 |
| 27 | 31.38 | |
| 28 | 31.56 | |
| 29 | Google Research | 32.14 |
| 30 | IBM TSFM | 32.22 |
| 31 | LongSeer-v1.0 LongShine AI Research | 33.41 |
| 32 | 35.26 | |
| 33 | 35.27 | |
| 34 | IBM TSFM & Rensselaer Polytechnic Institute | 36.06 |
| 35 | Reverso MIT | 36.13 |
| 36 | Datadog | 36.39 |
| 37 | Xihe-ultraHF Ant | 37.19 |
| 38 | 37.47 | |
| 39 | 37.78 | |
| 40 | VISIT-2.0 | 38.56 |
| 41 | Xihe-maxHF Ant | 39.91 |
| 42 | 41.21 | |
| 43 | 41.42 | |
| 44 | 41.60 | |
| 45 | TEMPO_ENSEMBLEHF Melady Lab @ USC | 42.39 |
| 46 | A ShanghaiTech University | 43.10 |
| 47 | 44.79 | |
| 48 | Datadog | 46.44 |
| 49 | A ShanghaiTech University | 46.80 |
| 50 | A ShanghaiTech University | 47.30 |
| 51 | IBM Research | 47.64 |
| 52 | AWS AI Labs | 47.85 |
| 53 | Datadog | 47.96 |
| 54 | Google Research | 49.49 |
| 55 | Tsinghua University | 51.08 |
| 56 | xLSTM-MixerHF AIML Lab @ TU Darmstadt | 52.87 |
| 57 | Reverso-Nano MIT | 53.15 |
| 58 | 53.38 | |
| 59 | AWS AI Labs | 53.97 |
| 60 | Alibaba | 55.31 |
| 61 | Alibaba | 56.09 |
| 62 | 56.40 | |
| 63 | 56.51 | |
| 64 | Alibaba | 58.95 |
| 65 | Salesforce AI Research | 60.65 |
| 66 | Lingjiang Alibaba Cloud | 60.92 |
| 67 | Salesforce AI Research | 61.34 |
| 68 | AWS AI Labs | 63.31 |
| 69 | 63.73 | |
| 70 | Princeton University | 66.22 |
| 71 | 66.37 | |
| 72 | 66.77 | |
| 73 | i_transformer Tsinghua University | 67.88 |
| 74 | Alibaba | 68.30 |
| 75 | 68.81 | |
| 76 | 70.32 | |
| 77 | TFT Google Research | 70.71 |
| 78 | VisionTSHF Zhejiang University | 72.09 |
| 79 | Salesforce AI Research | 72.22 |
| 80 | N-BEATS ServiceNow | 73.16 |
| 81 | 73.55 | |
| 82 | Auto_Arima | 74.96 |
| 83 | IBM Research | 75.94 |
| 84 | IBM Research | 77.85 |
| 85 | Auto_ETS | 78.36 |
| 86 | Auto_Theta | 78.58 |
| 87 | 78.63 | |
| 88 | DLinear The Chinese University of Hong Kong | 79.48 |
| 89 | Crossformer Shanghai Jiao Tong University | 79.87 |
| 90 | DeepAR Amazon Research | 80.73 |
| 91 | TIDE Google Research | 80.76 |
| 92 | 83.85 | |
| 93 | 84.22 | |
| 94 | VISIT-1.0 | 84.39 |
Model landscape
GIFT-Eval is not only a TSFM leaderboard — it includes classical statistical baselines (ARIMA, ETS, Theta), deep-learning architectures (PatchTST, iTransformer, TFT), and agentic systems, so foundation-model results can be compared against the full prior art.
- Pretrained
- 26(28%)
- Zero-shot
- 32(34%)
- Fine-tuned
- 6(6%)
- Agentic
- 14(15%)
- Deep learning
- 10(11%)
- Statistical
- 6(6%)
Why robustness across datasets matters
A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.
Understanding Weighted Quantile Loss
WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.
How to interpret it
- —Lower Average Rank is better because the leaderboard aggregates placement across many slices.
- —WQL matters when forecast calibration and uncertainty quality matter to the business.
- —Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.
Frequently asked questions
- What is GIFT-Eval?
- GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.
- What does Average Rank mean in GIFT-Eval?
- Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
- When should I use GIFT-Eval instead of FEV Bench?
- Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
- Does GIFT-Eval test multivariate forecasting?
- GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.
- What is the 'test leak' column?
- Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.
- What model types appear on the leaderboard?
- GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.
- How often is the leaderboard refreshed?
- TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.
Related reading
Compare with other TSFM benchmarks
How do models behave on observability telemetry instead of academic datasets?
Does model performance hold up as real time passes and the data distribution shifts?
Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?