Live·18 models · 13 cutoffs · 24 slices·Window 2026-03-01-002026-05-24-00

Impermanent Benchmark

Impermanent is designed to test whether benchmark wins survive once real time passes. It scores forecasts sequentially on a continuously updating GitHub activity stream, so the benchmark reflects temporal drift and the fact that future observations do not exist at training time. This page scrapes the live Impermanent dashboard on every refresh so you can see how each model’s rank moves across the prequential window — including per-subdataset, per-frequency, and per-sparsity drift.

ZeroModel leads with an average MASE rank of 6.56 across the prequential window.

What this benchmark answers

Does model performance hold up as real time passes and the data distribution shifts?

Methodology

The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.

Prequential leaderboard

Models ranked by their sustained MASE rank across 13 weekly cutoffs from Mar 1 to May 24.

#ModelMASE rankCRPS rank
🥇
ZeroModel
6.5582.413
🥈5.6754.650
🥉5.3255.429
4
AutoNBEATS
5.5725.294
5
AutoNHITS
6.0005.139
6
AutoTFT
6.0065.806
7
AutoPatchTST
7.3614.533
86.0676.467
95.9636.908
109.3789.856
11
AutoCES
9.32111.48
12
DynamicOptimizedTheta
9.34612.36
13
SeasonalNaive
5.79216.03
14
Toto
16.276.278
15
AutoETS
12.0712.54
16
AutoARIMA
12.8012.66
17
Prophet
14.7213.44
18
HistoricAverage
16.1815.70

Per-slice performance over time

Pick a subdataset, frequency, and sparsity bucket to see how each model scored at every prequential cutoff. Lower is better for both metrics.

Metric
Subdataset
Frequency
Sparsity
ModelMar 1Mar 8Mar 15Mar 22Mar 29Apr 5Apr 12Apr 19Apr 26May 3May 10May 17May 24
ZeroModel1.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.8800.259
TiRex1.0411.0691.2211.1040.9351.1551.1411.2481.1521.2861.3041.0050.798
Moirai1.0261.1111.2461.2240.9871.1891.2431.2031.0661.3041.4531.1160.874
TimesFM1.0491.2001.2481.2690.9851.1951.1521.2941.2041.3331.4171.1140.891
Chronos0.9511.0801.1491.1110.9631.2351.1231.2541.2361.3131.3501.0760.970
SeasonalNaive1.1361.2201.0841.3021.1861.2881.1531.2761.5211.3631.6981.1701.283
DynamicOptimizedTheta1.1561.4001.6001.7331.3462.0711.5741.7831.8442.1612.4242.0781.765
AutoETS1.2071.3951.6081.8761.5222.1871.7421.9621.9332.3072.3852.2761.940
AutoCES1.2261.3671.6341.6571.4272.1051.6341.8681.9612.1832.4042.2471.957
AutoARIMA1.2241.3841.6531.7561.5742.0991.7591.9122.0112.3342.4082.3212.042
Prophet1.3681.6261.9582.2071.9702.7612.3872.8882.9623.6164.0353.9993.460
HistoricAverage2.6593.5554.3565.3324.5277.7276.0417.5137.85110.2511.0812.7710.96
AutoNBEATS0.9641.0801.0971.0570.9541.2721.1371.171
AutoNHITS0.9850.9981.0371.0510.9851.2981.1101.110
AutoPatchTST1.2321.1061.1801.3171.1241.4301.2081.063
AutoTFT0.9230.9981.0570.9990.9311.0301.1041.027
FlowState1.1371.2701.3491.4411.1991.6771.3631.571
Toto2.1352.6333.2353.3972.8985.1433.5323.763

13 prequential cutoffs for Issues opened · Daily · Low sparsity. Values are scaled MASE against the seasonal naive baseline; lower is better.

18 models · 4 subdatasets · 2 frequencies · 3 sparsity buckets · Scraped live from impermanent.timecopilot.dev. Refreshes every 12 hours.Last refreshed Jun 13, 2026, 5:03 PM

The temporal contamination problem

Static benchmarks freeze a test split at one point in time. Once that split is public, model authors can overfit to it — intentionally or not — and benchmark scores stop reflecting real-world performance. Impermanent sidesteps this by scoring forecasts before outcomes exist: the future data literally does not exist at evaluation time, so there is no split to leak.

How the prequential protocol works

Models submit forecasts on a rolling basis against a continuously updating stream of GitHub activity data (issues opened, PRs merged, pushes, stargazers) across 400 repositories and 4 frequencies. Scores accumulate over time, so a model that performs well early but degrades under distribution shift will see its aggregate ranking fall. This makes Impermanent a measure of sustained forecasting ability, not one-off performance.

How to interpret it

  • Impermanent is the right benchmark when you care about temporal drift, not just a frozen snapshot.
  • A strong live benchmark should tell you whether early leaderboard wins persist under new data.
  • Because the public interface is still settling, treat the methodology as the durable asset and the rows as an upcoming integration.

Frequently asked questions

What is the Impermanent benchmark?
Impermanent is a live benchmark for time series foundation models that tests temporal generalization. It uses a prequential protocol on continuously updating GitHub activity data, scoring models on forecasts they made before outcomes existed.
Why is Impermanent listed as an emerging benchmark?
The benchmark methodology and dataset are established, but the public machine-readable leaderboard feed is still stabilizing. TSFM.ai will publish automated rankings once the upstream source exposes a durable feed.
What is temporal contamination?
Temporal contamination occurs when benchmark test data exists at the time of model training, allowing models to overfit — intentionally or not — to the specific test split. Impermanent avoids this because future observations do not exist at evaluation time.
What data does Impermanent use?
Impermanent uses GitHub activity data from 400 repositories across 4 event types (issues, pull requests, pushes, stargazers) at 4 frequencies, producing 6,400 total time series.

Related reading

Compare with other TSFM benchmarks

FEV Bench

How well does a model generalize to unseen real-world forecasting tasks?

GIFT-Eval

Which models stay strong across heterogeneous datasets and probabilistic settings?

BOOM

How do models behave on observability telemetry instead of academic datasets?

ARFBench

Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?

Sources