Live·77 models · 77 evaluated·Auto-refreshed from the official leaderboard every 12 hours

GIFT-Eval Leaderboard

GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice. This page mirrors the official Salesforce GIFT-Eval leaderboard with search, filters, and per-slice rankings — covering every submitted model, not just the top few — and links each result back to a hosted endpoint on TSFM.ai where available.

TSOrchestra ranks first with an average MASE rank of 11.72 across the GIFT-Eval dataset surface.

What this benchmark answers

Which models stay strong across heterogeneous datasets and probabilistic settings?

Methodology

Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.

GIFT-Eval leaderboard

Showing 77 of 77 models · 34 hosted on TSFM.ai · Sorted by overall MASE rank (lower is better)

# Model MASE rank
1
TSOrchestraHFCode

Melady Lab @ USC

11.72
2
DeOSAlphaTimeGPTPredictor-2025HF

vencortex®

12.03
3
MoiraiAgent-leakingHFCode

Salesforce AI Research

13.00
4
MoiraiAgentHFCode

Salesforce AI Research

13.90
5
CredenceHFCode

ContinualIST

14.07
6
Samay

Kairosity

16.11
7
ZooCast-Top1Code
16.38
8
TimeCopilotHFCode
17.04
9
Migas-1.0HF

Synthefy

17.09
10
SynapseHF

Google Cloud AI Research

17.13
1118.84
12
TSOrchestra-testHF

Melady Lab @ USC

20.38
13
IBM logo

IBM TSFM & Rensselaer Polytechnic Institute

21.70
14
Tsinghua University logo

Tsinghua & ByteDance

21.81
1521.93
16
Google logo

Google Research

22.06
17
LongSeer-v1.0

LongShine AI Research

22.94
1824.06
19
Reverso

MIT

25.03
20
IBM logo

IBM TSFM & Rensselaer Polytechnic Institute

25.09
21
Xihe-ultraHF

Ant

25.54
2226.18
2326.21
24
VISIT-2.0
27.16
25
Xihe-maxHF

Ant

28.00
2629.08
27
Reverso-SmallHFCode

MIT

29.18
28
FlowState-9.1MHFCode

IBM Research

29.38
29
A

ShanghaiTech University

30.93
3031.38
31
Salesforce logo

Salesforce AI Research

31.81
32
TEMPO_ENSEMBLEHF

Melady Lab @ USC

33.28
3333.37
34
A

ShanghaiTech University

33.70
35
A

ShanghaiTech University

34.13
3634.77
3735.00
38
Google logo

Google Research

36.89
39
Tsinghua University logo

Tsinghua University

37.71
40
Reverso-Nano

MIT

38.85
41
CleanTS-65MHFCode

Shandong University

39.33
42
xLSTM-MixerHF

AIML Lab @ TU Darmstadt

39.44
4340.23
4440.86
4541.71
46
TempoPFNHFCode

University of Freiburg

41.97
47
TabPFN-TSHFCode

PriorLabs

42.42
4844.22
49
Salesforce logo

Salesforce AI Research

46.36
50
Salesforce logo

Salesforce AI Research

46.96
51
Lingjiang

Alibaba Cloud

47.61
5248.66
53
Chronos_baseHFCode

AWS AI Labs

48.88
54
IBM logo

Princeton University

50.99
55
Chronos_smallHFCode

AWS AI Labs

51.62
56
i_transformer

Tsinghua University

52.45
5752.49
58
Google logo

Google Research

54.90
59
TFT

Google Research

54.98
60
VisionTSHF

Zhejiang University

55.91
61
Salesforce logo

Salesforce AI Research

56.43
62
N-BEATS

ServiceNow

57.16
63

IBM Research

57.63
64
FLAIR

Mellon Inc.

58.57
65
Auto_Arima
58.90
6659.75
6761.75
68
Seasonal_NaiveHFCode
62.18
69
Auto_ETS
62.48
70
Auto_Theta
62.75
71
DLinear

The Chinese University of Hong Kong

62.92
72
TIDE

Google Research

64.15
73
DeepAR

Amazon Research

64.30
74
Crossformer

Shanghai Jiao Tong University

64.60
75
ServiceNow logo

Morgan Stanley & Service Now

67.07
76
NaiveHFCode
67.66
77
VISIT-1.0
67.84
77 models · Aggregated live from the official GIFT-Eval leaderboard. Refreshes every 12 hours.Last refreshed Apr 29, 2026, 4:09 PM

Model landscape

GIFT-Eval is not only a TSFM leaderboard — it includes classical statistical baselines (ARIMA, ETS, Theta), deep-learning architectures (PatchTST, iTransformer, TFT), and agentic systems, so foundation-model results can be compared against the full prior art.

Pretrained
17(22%)
Zero-shot
32(42%)
Fine-tuned
4(5%)
Agentic
8(10%)
Deep learning
10(13%)
Statistical
6(8%)

Why robustness across datasets matters

A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.

Understanding Weighted Quantile Loss

WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.

How to interpret it

  • Lower Average Rank is better because the leaderboard aggregates placement across many slices.
  • WQL matters when forecast calibration and uncertainty quality matter to the business.
  • Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.

Frequently asked questions

What is GIFT-Eval?
GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.
What does Average Rank mean in GIFT-Eval?
Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
When should I use GIFT-Eval instead of FEV Bench?
Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
Does GIFT-Eval test multivariate forecasting?
GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.
What is the 'test leak' column?
Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.
What model types appear on the leaderboard?
GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.
How often is the leaderboard refreshed?
TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.

Related reading

Compare with other TSFM benchmarks

FEV Bench

How well does a model generalize zero-shot to unseen forecasting series?

BOOM

How do models behave on observability telemetry instead of academic datasets?

Impermanent

Does model performance hold up as real time passes and the data distribution shifts?

Sources