How well does a model generalize zero-shot to unseen forecasting series?
Impermanent Benchmark
Impermanent is designed to test whether benchmark wins survive once real time passes. It scores forecasts sequentially on a continuously updating GitHub activity stream, so the benchmark reflects temporal drift and the fact that future observations do not exist at training time. This page scrapes the live Impermanent dashboard on every refresh so you can see how each model’s rank moves across the prequential window — including per-subdataset, per-frequency, and per-sparsity drift.
Chronos leads with an average MASE rank of 3.94 across the prequential window.
What this benchmark answers
Does model performance hold up as real time passes and the data distribution shifts?
Methodology
The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.
Prequential leaderboard
Models ranked by their sustained MASE rank across 13 weekly cutoffs from Jan 18 to Apr 12.
Per-slice performance over time
Pick a subdataset, frequency, and sparsity bucket to see how each model scored at every prequential cutoff. Lower is better for both metrics.
| Model | Jan 18 | Jan 25 | Feb 1 | Feb 8 | Feb 15 | Feb 22 | Mar 1 | Mar 8 | Mar 15 | Mar 22 | Mar 29 | Apr 5 | Apr 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ZeroModel | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Chronos | 0.914 | 0.877 | 0.954 | 0.998 | 1.118 | 0.930 | 0.951 | 1.080 | 1.149 | 1.111 | 0.963 | 1.235 | 1.123 |
| TiRex | 0.990 | 0.897 | 0.990 | 1.011 | 1.167 | 0.902 | 1.041 | 1.069 | 1.221 | 1.104 | 0.935 | 1.155 | 1.141 |
| TimesFM | 0.959 | 0.919 | 1.012 | 1.059 | 1.210 | 0.937 | 1.049 | 1.200 | 1.248 | 1.269 | 0.985 | 1.195 | 1.152 |
| SeasonalNaive | 1.084 | 1.025 | 0.977 | 1.132 | 1.310 | 1.150 | 1.136 | 1.220 | 1.084 | 1.302 | 1.186 | 1.288 | 1.153 |
| Moirai | 1.009 | 0.875 | 0.968 | 1.079 | 1.224 | 0.988 | 1.026 | 1.111 | 1.246 | 1.224 | 0.987 | 1.189 | 1.243 |
| DynamicOptimizedTheta | 1.077 | 1.081 | 1.305 | 1.174 | 1.320 | 1.091 | 1.156 | 1.400 | 1.600 | 1.733 | 1.346 | 2.071 | 1.574 |
| AutoCES | 1.105 | 1.082 | 1.188 | 1.173 | 1.316 | 1.117 | 1.226 | 1.367 | 1.634 | 1.657 | 1.427 | 2.105 | 1.634 |
| AutoETS | 1.113 | 1.107 | 1.273 | 1.269 | 1.352 | 1.106 | 1.207 | 1.395 | 1.608 | 1.876 | 1.522 | 2.187 | 1.742 |
| AutoARIMA | 1.106 | 1.137 | 1.286 | 1.292 | 1.361 | 1.126 | 1.224 | 1.384 | 1.653 | 1.756 | 1.574 | 2.099 | 1.759 |
| Prophet | 1.259 | 1.199 | 1.358 | 1.356 | 1.586 | 1.386 | 1.368 | 1.626 | 1.958 | 2.207 | 1.970 | 2.761 | 2.387 |
| HistoricAverage | 2.354 | 2.165 | 2.677 | 2.514 | 3.224 | 2.455 | 2.659 | 3.555 | 4.356 | 5.332 | 4.527 | 7.727 | 6.041 |
13 prequential cutoffs for Issues opened · Daily · Low sparsity. Values are scaled MASE against the seasonal naive baseline; lower is better.
The temporal contamination problem
Static benchmarks freeze a test split at one point in time. Once that split is public, model authors can overfit to it — intentionally or not — and benchmark scores stop reflecting real-world performance. Impermanent sidesteps this by scoring forecasts before outcomes exist: the future data literally does not exist at evaluation time, so there is no split to leak.
How the prequential protocol works
Models submit forecasts on a rolling basis against a continuously updating stream of GitHub activity data (issues opened, PRs merged, pushes, stargazers) across 400 repositories and 4 frequencies. Scores accumulate over time, so a model that performs well early but degrades under distribution shift will see its aggregate ranking fall. This makes Impermanent a measure of sustained forecasting ability, not one-off performance.
How to interpret it
- —Impermanent is the right benchmark when you care about temporal drift, not just a frozen snapshot.
- —A strong live benchmark should tell you whether early leaderboard wins persist under new data.
- —Because the public interface is still settling, treat the methodology as the durable asset and the rows as an upcoming integration.
Frequently asked questions
- What is the Impermanent benchmark?
- Impermanent is a live benchmark for time series foundation models that tests temporal generalization. It uses a prequential protocol on continuously updating GitHub activity data, scoring models on forecasts they made before outcomes existed.
- Why is Impermanent listed as an emerging benchmark?
- The benchmark methodology and dataset are established, but the public machine-readable leaderboard feed is still stabilizing. TSFM.ai will publish automated rankings once the upstream source exposes a durable feed.
- What is temporal contamination?
- Temporal contamination occurs when benchmark test data exists at the time of model training, allowing models to overfit — intentionally or not — to the specific test split. Impermanent avoids this because future observations do not exist at evaluation time.
- What data does Impermanent use?
- Impermanent uses GitHub activity data from 400 repositories across 4 event types (issues, pull requests, pushes, stargazers) at 4 frequencies, producing 6,400 total time series.