How well does a model generalize to unseen real-world forecasting tasks?
Impermanent Benchmark
Impermanent is designed to test whether benchmark wins survive once real time passes. It scores forecasts sequentially on a continuously updating GitHub activity stream, so the benchmark reflects temporal drift and the fact that future observations do not exist at training time. This page scrapes the live Impermanent dashboard on every refresh so you can see how each model’s rank moves across the prequential window — including per-subdataset, per-frequency, and per-sparsity drift.
ZeroModel leads with an average MASE rank of 6.56 across the prequential window.
What this benchmark answers
Does model performance hold up as real time passes and the data distribution shifts?
Methodology
The benchmark uses a prequential protocol: models forecast before outcomes exist, scores accumulate over time, and rankings reflect sustained performance under live temporal change rather than one-off wins on a frozen split.
Prequential leaderboard
Models ranked by their sustained MASE rank across 13 weekly cutoffs from Mar 1 to May 24.
| # | Model | MASE rank | CRPS rank |
|---|---|---|---|
| 🥇 | ZeroModel | 6.558 | 2.413 |
| 🥈 | 5.675 | 4.650 | |
| 🥉 | 5.325 | 5.429 | |
| 4 | AutoNBEATS | 5.572 | 5.294 |
| 5 | AutoNHITS | 6.000 | 5.139 |
| 6 | AutoTFT | 6.006 | 5.806 |
| 7 | AutoPatchTST | 7.361 | 4.533 |
| 8 | 6.067 | 6.467 | |
| 9 | 5.963 | 6.908 | |
| 10 | 9.378 | 9.856 | |
| 11 | AutoCES | 9.321 | 11.48 |
| 12 | DynamicOptimizedTheta | 9.346 | 12.36 |
| 13 | SeasonalNaive | 5.792 | 16.03 |
| 14 | Toto | 16.27 | 6.278 |
| 15 | AutoETS | 12.07 | 12.54 |
| 16 | AutoARIMA | 12.80 | 12.66 |
| 17 | Prophet | 14.72 | 13.44 |
| 18 | HistoricAverage | 16.18 | 15.70 |
Per-slice performance over time
Pick a subdataset, frequency, and sparsity bucket to see how each model scored at every prequential cutoff. Lower is better for both metrics.
| Model | Mar 1 | Mar 8 | Mar 15 | Mar 22 | Mar 29 | Apr 5 | Apr 12 | Apr 19 | Apr 26 | May 3 | May 10 | May 17 | May 24 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ZeroModel | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.880 | 0.259 |
| TiRex | 1.041 | 1.069 | 1.221 | 1.104 | 0.935 | 1.155 | 1.141 | 1.248 | 1.152 | 1.286 | 1.304 | 1.005 | 0.798 |
| Moirai | 1.026 | 1.111 | 1.246 | 1.224 | 0.987 | 1.189 | 1.243 | 1.203 | 1.066 | 1.304 | 1.453 | 1.116 | 0.874 |
| TimesFM | 1.049 | 1.200 | 1.248 | 1.269 | 0.985 | 1.195 | 1.152 | 1.294 | 1.204 | 1.333 | 1.417 | 1.114 | 0.891 |
| Chronos | 0.951 | 1.080 | 1.149 | 1.111 | 0.963 | 1.235 | 1.123 | 1.254 | 1.236 | 1.313 | 1.350 | 1.076 | 0.970 |
| SeasonalNaive | 1.136 | 1.220 | 1.084 | 1.302 | 1.186 | 1.288 | 1.153 | 1.276 | 1.521 | 1.363 | 1.698 | 1.170 | 1.283 |
| DynamicOptimizedTheta | 1.156 | 1.400 | 1.600 | 1.733 | 1.346 | 2.071 | 1.574 | 1.783 | 1.844 | 2.161 | 2.424 | 2.078 | 1.765 |
| AutoETS | 1.207 | 1.395 | 1.608 | 1.876 | 1.522 | 2.187 | 1.742 | 1.962 | 1.933 | 2.307 | 2.385 | 2.276 | 1.940 |
| AutoCES | 1.226 | 1.367 | 1.634 | 1.657 | 1.427 | 2.105 | 1.634 | 1.868 | 1.961 | 2.183 | 2.404 | 2.247 | 1.957 |
| AutoARIMA | 1.224 | 1.384 | 1.653 | 1.756 | 1.574 | 2.099 | 1.759 | 1.912 | 2.011 | 2.334 | 2.408 | 2.321 | 2.042 |
| Prophet | 1.368 | 1.626 | 1.958 | 2.207 | 1.970 | 2.761 | 2.387 | 2.888 | 2.962 | 3.616 | 4.035 | 3.999 | 3.460 |
| HistoricAverage | 2.659 | 3.555 | 4.356 | 5.332 | 4.527 | 7.727 | 6.041 | 7.513 | 7.851 | 10.25 | 11.08 | 12.77 | 10.96 |
| AutoNBEATS | 0.964 | 1.080 | 1.097 | 1.057 | 0.954 | 1.272 | 1.137 | 1.171 | — | — | — | — | — |
| AutoNHITS | 0.985 | 0.998 | 1.037 | 1.051 | 0.985 | 1.298 | 1.110 | 1.110 | — | — | — | — | — |
| AutoPatchTST | 1.232 | 1.106 | 1.180 | 1.317 | 1.124 | 1.430 | 1.208 | 1.063 | — | — | — | — | — |
| AutoTFT | 0.923 | 0.998 | 1.057 | 0.999 | 0.931 | 1.030 | 1.104 | 1.027 | — | — | — | — | — |
| FlowState | 1.137 | 1.270 | 1.349 | 1.441 | 1.199 | 1.677 | 1.363 | 1.571 | — | — | — | — | — |
| Toto | 2.135 | 2.633 | 3.235 | 3.397 | 2.898 | 5.143 | 3.532 | 3.763 | — | — | — | — | — |
13 prequential cutoffs for Issues opened · Daily · Low sparsity. Values are scaled MASE against the seasonal naive baseline; lower is better.
The temporal contamination problem
Static benchmarks freeze a test split at one point in time. Once that split is public, model authors can overfit to it — intentionally or not — and benchmark scores stop reflecting real-world performance. Impermanent sidesteps this by scoring forecasts before outcomes exist: the future data literally does not exist at evaluation time, so there is no split to leak.
How the prequential protocol works
Models submit forecasts on a rolling basis against a continuously updating stream of GitHub activity data (issues opened, PRs merged, pushes, stargazers) across 400 repositories and 4 frequencies. Scores accumulate over time, so a model that performs well early but degrades under distribution shift will see its aggregate ranking fall. This makes Impermanent a measure of sustained forecasting ability, not one-off performance.
How to interpret it
- —Impermanent is the right benchmark when you care about temporal drift, not just a frozen snapshot.
- —A strong live benchmark should tell you whether early leaderboard wins persist under new data.
- —Because the public interface is still settling, treat the methodology as the durable asset and the rows as an upcoming integration.
Frequently asked questions
- What is the Impermanent benchmark?
- Impermanent is a live benchmark for time series foundation models that tests temporal generalization. It uses a prequential protocol on continuously updating GitHub activity data, scoring models on forecasts they made before outcomes existed.
- Why is Impermanent listed as an emerging benchmark?
- The benchmark methodology and dataset are established, but the public machine-readable leaderboard feed is still stabilizing. TSFM.ai will publish automated rankings once the upstream source exposes a durable feed.
- What is temporal contamination?
- Temporal contamination occurs when benchmark test data exists at the time of model training, allowing models to overfit — intentionally or not — to the specific test split. Impermanent avoids this because future observations do not exist at evaluation time.
- What data does Impermanent use?
- Impermanent uses GitHub activity data from 400 repositories across 4 event types (issues, pull requests, pushes, stargazers) at 4 frequencies, producing 6,400 total time series.
Related reading
Compare with other TSFM benchmarks
Which models stay strong across heterogeneous datasets and probabilistic settings?
How do models behave on observability telemetry instead of academic datasets?
Which multimodal models can reason about anomalies, timing, magnitude, and cross-series structure in production telemetry?