GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.

What does Average Rank mean in GIFT-Eval?

Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.

When should I use GIFT-Eval instead of FEV Bench?

Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.

Does GIFT-Eval test multivariate forecasting?

GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.

What is the 'test leak' column?

Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.

What model types appear on the leaderboard?

GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.

How often is the leaderboard refreshed?

TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.

Live·77 models · 77 evaluated·Auto-refreshed from the official leaderboard every 12 hours

GIFT-Eval Leaderboard

GIFT-Eval expands the question beyond raw point accuracy. It looks at how models perform across many datasets, frequencies, and forecasting settings, making it a strong benchmark for teams that care about robustness rather than a single flattering leaderboard slice. This page mirrors the official Salesforce GIFT-Eval leaderboard with search, filters, and per-slice rankings — covering every submitted model, not just the top few — and links each result back to a hosted endpoint on TSFM.ai where available.

TSOrchestra ranks first with an average MASE rank of 11.72 across the GIFT-Eval dataset surface.

What this benchmark answers

Which models stay strong across heterogeneous datasets and probabilistic settings?

Methodology

Models are scored on grouped benchmark slices and ranked by average rank, with Weighted Quantile Loss providing a secondary read on probabilistic accuracy.

Showing 77 of 77 models · 34 hosted on TSFM.ai · Sorted by overall MASE rank (lower is better)

#	Model	Type	MASE rank	CRPS rank	MASE	CRPS	Leak	Repl. code	Hosted
1	TSOrchestraHFCode Melady Lab @ USC	Agentic	11.72	10.47	0.68	0.47	No	No	—
2	DeOSAlphaTimeGPTPredictor-2025HF vencortex®	Zero-shot	12.03	10.72	0.68	0.47	No	No	—
3	MoiraiAgent-leakingHFCode Salesforce AI Research	Agentic	13.00	15.02	0.68	0.47	Yes	Yes	—
4	MoiraiAgentHFCode Salesforce AI Research	Agentic	13.90	15.38	0.69	0.48	No	Yes	—
5	CredenceHFCode ContinualIST	Agentic	14.07	12.32	0.69	0.47	No	Yes	—
6	Samay Kairosity	Agentic	16.11	13.54	0.70	0.48	No	No	—
7	ZooCast-Top1Code	Agentic	16.38	16.94	0.69	0.48	No	No	—
8	TimeCopilotHFCode	Agentic	17.04	15.31	0.71	0.48	No	Yes	—
9	Migas-1.0HF Synthefy	Pretrained	17.09	49.82	0.70	0.60	No	No	—
10	SynapseHF Google Cloud AI Research	Agentic	17.13	14.06	0.70	0.48	No	No	—
11	Chronos-2HFCode AWS	Pretrained	18.84	18.94	0.70	0.49	No	Yes	View model
12	TSOrchestra-testHF Melady Lab @ USC	Fine-tuned	20.38	13.89	0.70	0.47	Yes	No	—
13	PatchTST-FM-r1HFCode IBM TSFM & Rensselaer Polytechnic Institute	Zero-shot	21.70	18.36	0.71	0.48	No	Yes	View model
14	Timer-S1HFCode Tsinghua & ByteDance	Pretrained	21.81	21.47	0.69	0.49	No	Yes	View model
15	FlowState-r1.1HFCode IBM TSFM	Zero-shot	21.93	20.41	0.70	0.49	No	Yes	View model
16	TimesFM-2.5HFCode Google Research	Zero-shot	22.06	20.88	0.71	0.49	No	Yes	View model
17	LongSeer-v1.0 LongShine AI Research	Zero-shot	22.94	22.62	0.71	0.49	No	No	—
18	TiRexHFCode NXAI	Zero-shot	24.06	19.00	0.72	0.49	No	Yes	View model
19	Reverso MIT	Zero-shot	25.03	51.71	0.71	0.61	No	No	—
20	Granite-PatchTST-FM-r1HFCode IBM TSFM & Rensselaer Polytechnic Institute	Zero-shot	25.09	20.42	0.72	0.49	No	Yes	View model
21	Xihe-ultraHF Ant	Zero-shot	25.54	23.73	0.70	0.49	No	No	—
22	Chronos-2-SynthHFCode AWS	Zero-shot	26.18	24.96	0.72	0.50	No	Yes	View model
23	TTM-R3-FTHFCode IBM TSFM	Fine-tuned	26.21	27.88	0.71	0.51	No	Yes	View model
24	VISIT-2.0	Pretrained	27.16	24.73	0.71	0.50	No	No	—
25	Xihe-maxHF Ant	Zero-shot	28.00	25.56	0.71	0.49	No	No	—
26	TTM-R3-PTHFCode IBM TSFM	Pretrained	29.08	31.32	0.72	0.52	No	Yes	View model
27	Reverso-SmallHFCode MIT	Zero-shot	29.18	53.69	0.73	0.63	No	Yes	—
28	FlowState-9.1MHFCode IBM Research	Zero-shot	29.38	24.62	0.73	0.50	No	Yes	—
29	A Kairos_50mHFCode ShanghaiTech University	Zero-shot	30.93	38.95	0.74	0.55	No	Yes	View model
30	granite-flowstate-r1HFCode IBM Research	Zero-shot	31.38	26.77	0.73	0.50	No	Yes	View model
31	Moirai2HFCode Salesforce AI Research	Pretrained	31.81	30.01	0.73	0.52	No	Yes	View family
32	TEMPO_ENSEMBLEHF Melady Lab @ USC	Fine-tuned	33.28	26.11	0.79	0.46	Yes	No	—
33	Toto_Open_Base_1.0HFCode Datadog	Zero-shot	33.37	28.64	0.75	0.52	No	Yes	View model
34	A Kairos_23mHFCode ShanghaiTech University	Zero-shot	33.70	40.40	0.75	0.55	No	Yes	View model
35	A Kairos_10mHFCode ShanghaiTech University	Zero-shot	34.13	41.34	0.75	0.55	No	Yes	View model
36	TTM-R2-FinetunedHFCode IBM Research	Fine-tuned	34.77	42.52	0.76	0.58	Yes	Yes	View model
37	chronos_bolt_baseHFCode AWS AI Labs	Pretrained	35.00	35.64	0.81	0.57	Yes	Yes	View model
38	timesfm_2_0_500mHFCode Google Research	Pretrained	36.89	35.89	0.76	0.55	Yes	Yes	View model
39	sundial_base_128mHFCode Tsinghua University	Zero-shot	37.71	40.95	0.75	0.56	No	Yes	View model
40	Reverso-Nano MIT	Zero-shot	38.85	56.95	0.76	0.66	No	No	—
41	CleanTS-65MHFCode Shandong University	Zero-shot	39.33	32.56	0.80	0.54	No	Yes	—
42	xLSTM-MixerHF AIML Lab @ TU Darmstadt	Deep learning	39.44	28.39	0.78	0.51	No	No	—
43	chronos_bolt_smallHFCode AWS AI Labs	Pretrained	40.23	38.49	0.82	0.58	Yes	Yes	View model
44	YingLong_300mHFCode Alibaba	Zero-shot	40.86	35.13	0.80	0.55	No	Yes	View model
45	YingLong_110mHFCode Alibaba	Zero-shot	41.71	36.31	0.81	0.56	No	Yes	View model
46	TempoPFNHFCode University of Freiburg	Zero-shot	41.97	34.11	0.79	0.53	No	Yes	—
47	TabPFN-TSHFCode PriorLabs	Zero-shot	42.42	36.58	0.77	0.54	No	Yes	—
48	YingLong_50mHFCode Alibaba	Zero-shot	44.22	38.78	0.82	0.57	No	Yes	View model
49	Moirai_largeHFCode Salesforce AI Research	Zero-shot	46.36	40.46	0.87	0.60	No	Yes	View model
50	Moirai_baseHFCode Salesforce AI Research	Zero-shot	46.96	40.36	0.90	0.61	No	Yes	View model
51	Lingjiang Alibaba Cloud	Zero-shot	47.61	42.67	1.00	0.62	No	No	—
52	Chronos_largeHFCode AWS AI Labs	Pretrained	48.66	50.49	0.87	0.65	Yes	Yes	View model
53	Chronos_baseHFCode AWS AI Labs	Pretrained	48.88	50.84	0.88	0.65	Yes	Yes	—
54	PatchTST Princeton University	Deep learning	50.99	44.03	0.85	0.59	No	No	View model
55	Chronos_smallHFCode AWS AI Labs	Pretrained	51.62	51.99	0.89	0.66	Yes	Yes	—
56	i_transformer Tsinghua University	Deep learning	52.45	46.02	0.89	0.62	No	Yes	—
57	YingLong_6mHFCode Alibaba	Zero-shot	52.49	46.63	0.88	0.61	No	Yes	View model
58	TimesFMHFCode Google Research	Pretrained	54.90	50.23	1.08	0.68	Yes	Yes	View family
59	TFT Google Research	Deep learning	54.98	44.93	0.92	0.60	No	No	—
60	VisionTSHF Zhejiang University	Zero-shot	55.91	63.58	0.86	0.75	No	No	—
61	Moirai_smallHFCode Salesforce AI Research	Zero-shot	56.43	48.69	0.95	0.65	No	Yes	View model
62	N-BEATS ServiceNow	Deep learning	57.16	64.20	0.94	0.82	No	No	—
63	FFMHFCode IBM Research	Deep learning	57.63	52.47	1.02	0.70	No	No	—
64	FLAIR Mellon Inc.	Statistical	58.57	56.12	0.92	0.69	No	No	—
65	Auto_Arima	Statistical	58.90	63.31	1.07	0.91	No	No	—
66	TTM-R2-PretrainedHFCode IBM Research	Pretrained	59.75	66.99	1.02	0.87	Yes	Yes	View model
67	TTM-R1-PretrainedHFCode IBM Research	Pretrained	61.75	67.25	1.08	0.89	Yes	Yes	View model
68	Seasonal_NaiveHFCode	Statistical	62.18	67.18	1.00	1.00	No	Yes	—
69	Auto_ETS	Statistical	62.48	66.73	1.21	7.49	No	No	—
70	Auto_Theta	Statistical	62.75	64.76	1.09	1.24	No	No	—
71	DLinear The Chinese University of Hong Kong	Deep learning	62.92	67.00	1.06	0.85	No	No	—
72	TIDE Google Research	Deep learning	64.15	59.41	1.09	0.77	No	No	—
73	DeepAR Amazon Research	Deep learning	64.30	58.38	1.34	0.85	No	No	—
74	Crossformer Shanghai Jiao Tong University	Deep learning	64.60	57.90	2.57	1.64	No	No	—
75	Lag-LlamaHF Morgan Stanley & Service Now	Pretrained	67.07	64.41	1.23	0.88	Yes	No	View model
76	NaiveHFCode	Statistical	67.66	71.77	1.27	1.59	No	Yes	—
77	VISIT-1.0	Pretrained	67.84	71.90	1.27	1.59	No	No	—

77 models · Aggregated live from the official GIFT-Eval leaderboard. Refreshes every 12 hours.Last refreshed Apr 29, 2026, 4:09 PM

Model landscape

GIFT-Eval is not only a TSFM leaderboard — it includes classical statistical baselines (ARIMA, ETS, Theta), deep-learning architectures (PatchTST, iTransformer, TFT), and agentic systems, so foundation-model results can be compared against the full prior art.

Pretrained: 17(22%)
Zero-shot: 32(42%)
Fine-tuned: 4(5%)
Agentic: 8(10%)
Deep learning: 10(13%)
Statistical: 6(8%)

Why robustness across datasets matters

A model that tops one leaderboard can collapse on a different frequency or domain. GIFT-Eval forces models to prove themselves across 23 grouped dataset slices covering different frequencies, horizons, and series shapes. If a model ranks well here, you can be more confident it will not surprise you when your data does not look like the training distribution.

Understanding Weighted Quantile Loss

WQL penalizes both overconfident and underconfident prediction intervals. A model with a low WQL produces forecast distributions that are well-calibrated — the 90th percentile prediction actually lands above the true value about 90% of the time. This matters for capacity planning, inventory, and any decision that depends on reliable uncertainty estimates rather than just the median forecast.

How to interpret it

—Lower Average Rank is better because the leaderboard aggregates placement across many slices.
—WQL matters when forecast calibration and uncertainty quality matter to the business.
—Use GIFT-Eval to sanity-check whether a model is robust beyond a single narrow domain.

Frequently asked questions

What is GIFT-Eval?: GIFT-Eval (General Time Series Forecasting Model Evaluation) is a probabilistic forecasting benchmark from Salesforce that evaluates time series foundation models across 23 diverse dataset groups spanning 7 domains and 10 frequencies. Models are ranked by Average Rank and Average Weighted Quantile Loss.
What does Average Rank mean in GIFT-Eval?: Average Rank is the mean placement of a model across every benchmark slice. A lower value means the model consistently finishes near the top across heterogeneous datasets rather than dominating only one slice.
When should I use GIFT-Eval instead of FEV Bench?: Use GIFT-Eval when you care about probabilistic forecast quality (calibrated uncertainty) and robustness across many different data domains, rather than just point-forecast accuracy on a single leaderboard.
Does GIFT-Eval test multivariate forecasting?: GIFT-Eval includes both univariate and multivariate slices. The leaderboard on TSFM.ai aggregates both variate types from the upstream grouped-by-univariate file, and you can filter per-slice rankings to inspect multivariate-only performance.
What is the 'test leak' column?: Some GIFT-Eval submissions are from models whose training data overlaps with the evaluation datasets. A 'Yes' in the test-leak column means the authors disclosed partial or full pretraining-data overlap; use the filter to compare only models with no known test-data leakage.
What model types appear on the leaderboard?: GIFT-Eval ranks pretrained foundation models, zero-shot models, fine-tuned models, agentic systems, classical deep-learning architectures (PatchTST, iTransformer, TFT, TiDE, N-BEATS), and statistical baselines (ARIMA, ETS, Theta). The page shows every type so you can benchmark a TSFM against classical prior art.
How often is the leaderboard refreshed?: TSFM.ai refetches the upstream Salesforce GIFT-Eval results every 12 hours via Next.js ISR. The 'last refreshed' timestamp at the bottom of the leaderboard reflects the most recent successful refresh.

Title	Date
GIFT-Eval: Salesforce's Comprehensive TSFM Benchmark	2026-02-15	Read →
The State of Multivariate Forecasting in 2025	2025-05-08	Read →
The Challenges of Benchmarking TSFMs	2024-11-05	Read →