What is the Impermanent benchmark?

Impermanent is a live benchmark for time series foundation models that tests temporal generalization. It uses a prequential protocol on continuously updating GitHub activity data, scoring models on forecasts they made before outcomes existed.

Why is Impermanent listed as an emerging benchmark?

The benchmark methodology and dataset are established, but the public machine-readable leaderboard feed is still stabilizing. TSFM.ai will publish automated rankings once the upstream source exposes a durable feed.

What is temporal contamination?

Temporal contamination occurs when benchmark test data exists at the time of model training, allowing models to overfit — intentionally or not — to the specific test split. Impermanent avoids this because future observations do not exist at evaluation time.

What data does Impermanent use?

Impermanent uses GitHub activity data from 400 repositories across 4 event types (issues, pull requests, pushes, stargazers) at 4 frequencies, producing 6,400 total time series.

Live·12 models · 13 cutoffs · 24 slices·Window 2026-01-18-00 → 2026-04-12-00

Impermanent Benchmark

Name: Impermanent Benchmark (TSFM.ai snapshot)
Creator: TimeCopilot

Impermanent is designed to test whether benchmark wins survive once real time passes. It scores forecasts sequentially on a continuously updating GitHub activity stream, so the benchmark reflects temporal drift and the fact that future observations do not exist at training time. This page scrapes the live Impermanent dashboard on every refresh so you can see how each model’s rank moves across the prequential window — including per-subdataset, per-frequency, and per-sparsity drift.

Chronos leads with an average MASE rank of 3.94 across the prequential window.

What this benchmark answers

Does model performance hold up as real time passes and the data distribution shifts?

Methodology

The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.

Prequential leaderboard

Models ranked by their sustained MASE rank across 13 weekly cutoffs from Jan 18 to Apr 12.

#	Model	MASE rank	CRPS rank	Scaled MASE	Scaled CRPS	Hosted
🥇	Chronos	3.937	2.758	0.391	1.024	View family
🥈	TiRex	3.849	3.187	0.392	1.070	View model
🥉	ZeroModel	5.516	1.893	0.521	0.524	—
4	Moirai	4.230	3.345	0.411	1.262	View family
5	TimesFM	4.067	3.921	0.421	1.112	View family
6	AutoCES	6.226	7.159	0.666	5.357	—
7	DynamicOptimizedTheta	6.163	7.750	0.687	6.288	—
8	SeasonalNaive	4.413	11.56	0.456	10.17	—
9	AutoETS	8.103	8.020	0.832	5.955	—
10	AutoARIMA	9.171	8.194	0.983	6.157	—
11	Prophet	10.47	8.984	1.635	6.433	—
12	HistoricAverage	11.85	11.23	2.790	7.361	—

Per-slice performance over time

Pick a subdataset, frequency, and sparsity bucket to see how each model scored at every prequential cutoff. Lower is better for both metrics.

Metric

Subdataset

Frequency

Sparsity

Model	Jan 18	Jan 25	Feb 1	Feb 8	Feb 15	Feb 22	Mar 1	Mar 8	Mar 15	Mar 22	Mar 29	Apr 5	Apr 12
ZeroModel	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
Chronos	0.914	0.877	0.954	0.998	1.118	0.930	0.951	1.080	1.149	1.111	0.963	1.235	1.123
TiRex	0.990	0.897	0.990	1.011	1.167	0.902	1.041	1.069	1.221	1.104	0.935	1.155	1.141
TimesFM	0.959	0.919	1.012	1.059	1.210	0.937	1.049	1.200	1.248	1.269	0.985	1.195	1.152
SeasonalNaive	1.084	1.025	0.977	1.132	1.310	1.150	1.136	1.220	1.084	1.302	1.186	1.288	1.153
Moirai	1.009	0.875	0.968	1.079	1.224	0.988	1.026	1.111	1.246	1.224	0.987	1.189	1.243
DynamicOptimizedTheta	1.077	1.081	1.305	1.174	1.320	1.091	1.156	1.400	1.600	1.733	1.346	2.071	1.574
AutoCES	1.105	1.082	1.188	1.173	1.316	1.117	1.226	1.367	1.634	1.657	1.427	2.105	1.634
AutoETS	1.113	1.107	1.273	1.269	1.352	1.106	1.207	1.395	1.608	1.876	1.522	2.187	1.742
AutoARIMA	1.106	1.137	1.286	1.292	1.361	1.126	1.224	1.384	1.653	1.756	1.574	2.099	1.759
Prophet	1.259	1.199	1.358	1.356	1.586	1.386	1.368	1.626	1.958	2.207	1.970	2.761	2.387
HistoricAverage	2.354	2.165	2.677	2.514	3.224	2.455	2.659	3.555	4.356	5.332	4.527	7.727	6.041

13 prequential cutoffs for Issues opened · Daily · Low sparsity. Values are scaled MASE against the seasonal naive baseline; lower is better.

12 models · 4 subdatasets · 2 frequencies · 3 sparsity buckets · Scraped live from impermanent.timecopilot.dev. Refreshes every 12 hours.Last refreshed Apr 29, 2026, 4:59 PM

The temporal contamination problem

Static benchmarks freeze a test split at one point in time. Once that split is public, model authors can overfit to it — intentionally or not — and benchmark scores stop reflecting real-world performance. Impermanent sidesteps this by scoring forecasts before outcomes exist: the future data literally does not exist at evaluation time, so there is no split to leak.

How the prequential protocol works

Models submit forecasts on a rolling basis against a continuously updating stream of GitHub activity data (issues opened, PRs merged, pushes, stargazers) across 400 repositories and 4 frequencies. Scores accumulate over time, so a model that performs well early but degrades under distribution shift will see its aggregate ranking fall. This makes Impermanent a measure of sustained forecasting ability, not one-off performance.

How to interpret it

—Impermanent is the right benchmark when you care about temporal drift, not just a frozen snapshot.
—A strong live benchmark should tell you whether early leaderboard wins persist under new data.
—Because the public interface is still settling, treat the methodology as the durable asset and the rows as an upcoming integration.

Frequently asked questions

What is the Impermanent benchmark?: Impermanent is a live benchmark for time series foundation models that tests temporal generalization. It uses a prequential protocol on continuously updating GitHub activity data, scoring models on forecasts they made before outcomes existed.
Why is Impermanent listed as an emerging benchmark?: The benchmark methodology and dataset are established, but the public machine-readable leaderboard feed is still stabilizing. TSFM.ai will publish automated rankings once the upstream source exposes a durable feed.
What is temporal contamination?: Temporal contamination occurs when benchmark test data exists at the time of model training, allowing models to overfit — intentionally or not — to the specific test split. Impermanent avoids this because future observations do not exist at evaluation time.
What data does Impermanent use?: Impermanent uses GitHub activity data from 400 repositories across 4 event types (issues, pull requests, pushes, stargazers) at 4 frequencies, producing 6,400 total time series.

Title	Date
Impermanent: The First Live Benchmark for Temporal Generalization in TSFMs	2026-03-14	Read →
The Challenges of Benchmarking TSFMs	2024-11-05	Read →
Zero-Shot Forecasting: Why It Matters	2024-04-12	Read →