Live·94 models · 94 evaluated·Auto-refreshed from the official leaderboard every 12 hours

GIFT-Eval Leaderboard

GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice. This page mirrors the official Salesforce GIFT-Eval leaderboard with search, filters, and per-slice rankings — covering every submitted model, not just the top few — and links each result back to a hosted endpoint on TSFM.ai where available.

Cobra-Agent ranks first with an average MASE rank of 14.63 across the GIFT-Eval dataset surface.

What this benchmark answers

Which models stay strong across heterogeneous datasets and probabilistic settings?

Methodology

Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.

GIFT-Eval leaderboard

Showing 94 of 94 models · 37 hosted on TSFM.ai · Sorted by overall MASE rank (lower is better)

# Model MASE rank
1
Cobra-AgentHF

Dalpha AI

14.63
2
Toto-2.0-FnFHFCode

Datadog

16.45
3
RAES-Conductance-EnsembleHFCode

California State University Northridge

16.62
4
Taichu-TimeSeries-AgentHFCode

zidongtaichu

17.58
5
Toto-2.0-2.5B-FTHFCode

Datadog

18.29
6
TSOrchestraHFCode

Melady Lab @ USC

18.41
7
DeOSAlphaTimeGPTPredictor-2025HF

vencortex®

19.03
8
MoiraiAgent-leakingHFCode

Salesforce AI Research

20.18
9
RAES-Conductance-Ensemble-VHFCode

SETI

20.34
10
MoiraiAgentHFCode

Salesforce AI Research

21.90
11
CredenceHFCode

ContinualIST

22.02
12
Samay

Kairosity

24.41
13
Toto-2.0-2.5BHFCode

Datadog

25.40
14
ZooCast-Top1Code
25.46
15
Migas-1.0HF

Synthefy

25.67
16
STRIDE (+Chronos-2)HF

Google Cloud AI Research

25.75
17
SynapseHF

Google Cloud AI Research

26.01
18
TimeCopilotHFCode
26.01
19
STRIDE (+Timer-S1)HF

Google Cloud AI Research

26.41
2026.70
2127.68
2228.33
23
Prism

Birla AI Labs

29.68
24
TurkForecast-FM-Chronos2-LoRA-v1HFCode

TurkForecast (Mert Karatay)

30.09
25
TSOrchestra-testHF

Melady Lab @ USC

30.18
26
IBM logo

IBM TSFM & Rensselaer Polytechnic Institute

31.36
27
Granite-FlowState-r1.1HFCode

IBM TSFM

31.38
28
Tsinghua University logo

Tsinghua & ByteDance

31.56
29
Google logo

Google Research

32.14
3032.22
31
LongSeer-v1.0

LongShine AI Research

33.41
3235.26
33
Falcon-XHFCode

ant-intl

35.27
34
IBM logo

IBM TSFM & Rensselaer Polytechnic Institute

36.06
35
Reverso

MIT

36.13
3636.39
37
Xihe-ultraHF

Ant

37.19
3837.47
3937.78
40
VISIT-2.0
38.56
41
Xihe-maxHF

Ant

39.91
4241.21
43
Reverso-SmallHFCode

MIT

41.42
44
FlowState-9.1MHFCode

IBM Research

41.60
45
TEMPO_ENSEMBLEHF

Melady Lab @ USC

42.39
46
A

ShanghaiTech University

43.10
47
Salesforce logo

Salesforce AI Research

44.79
4846.44
49
A

ShanghaiTech University

46.80
50
A

ShanghaiTech University

47.30
5147.64
5247.85
5347.96
54
Google logo

Google Research

49.49
55
Tsinghua University logo

Tsinghua University

51.08
56
xLSTM-MixerHF

AIML Lab @ TU Darmstadt

52.87
57
Reverso-Nano

MIT

53.15
58
CleanTS-65MHFCode

Shandong University

53.38
5953.97
6055.31
6156.09
62
TempoPFNHFCode

University of Freiburg

56.40
63
TabPFN-TSHFCode

PriorLabs

56.51
6458.95
65
Salesforce logo

Salesforce AI Research

60.65
66
Lingjiang

Alibaba Cloud

60.92
67
Salesforce logo

Salesforce AI Research

61.34
6863.31
69
Chronos_baseHFCode

AWS AI Labs

63.73
70
IBM logo

Princeton University

66.22
71
FLAIRHFCode

Mellon Inc.

66.37
72
Chronos_smallHFCode

AWS AI Labs

66.77
73
i_transformer

Tsinghua University

67.88
7468.30
75
Super-LinearHFCode

Ben-Gurion University of the Negev

68.81
76
Google logo

Google Research

70.32
77
TFT

Google Research

70.71
78
VisionTSHF

Zhejiang University

72.09
79
Salesforce logo

Salesforce AI Research

72.22
80
N-BEATS

ServiceNow

73.16
81

IBM Research

73.55
82
Auto_Arima
74.96
8375.94
8477.85
85
Auto_ETS
78.36
86
Auto_Theta
78.58
87
Seasonal_NaiveHFCode
78.63
88
DLinear

The Chinese University of Hong Kong

79.48
89
Crossformer

Shanghai Jiao Tong University

79.87
90
DeepAR

Amazon Research

80.73
91
TIDE

Google Research

80.76
92
ServiceNow logo

Morgan Stanley & Service Now

83.85
93
NaiveHFCode
84.22
94
VISIT-1.0
84.39
94 models · Aggregated live from the official GIFT-Eval leaderboard. Refreshes every 12 hours.Last refreshed Jun 13, 2026, 6:20 PM

Model landscape

GIFT-Eval is not only a TSFM leaderboard — it includes classical statistical baselines (ARIMA, ETS, Theta), deep-learning architectures (PatchTST, iTransformer, TFT), and agentic systems, so foundation-model results can be compared against the full prior art.

Pretrained
26(28%)
Zero-shot
32(34%)
Fine-tuned
6(6%)
Agentic
14(15%)
Deep learning
10(11%)
Statistical
6(6%)

Why robustness across datasets matters

A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.

Understanding Weighted Quantile Loss

WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.

How to interpret it

  • Lower Average Rank is better because the leaderboard aggregates placement across many slices.
  • WQL matters when forecast calibration and uncertainty quality matter to the business.
  • Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.

Frequently asked questions

What is GIFT-Eval?
GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.
What does Average Rank mean in GIFT-Eval?
Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
When should I use GIFT-Eval instead of FEV Bench?
Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
Does GIFT-Eval test multivariate forecasting?
GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.
What is the 'test leak' column?
Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.
What model types appear on the leaderboard?
GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.
How often is the leaderboard refreshed?
TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.

Related reading

Compare with other TSFM benchmarks

FEV Bench

How well does a model generalize to unseen real-world forecasting tasks?

BOOM

How do models behave on observability telemetry instead of academic datasets?

Impermanent

Does model performance hold up as real time passes and the data distribution shifts?

ARFBench

Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?

Sources