Impermanent: The First Live Benchmark for Temporal Generalization in TSFMs
Static benchmarks contaminate themselves over time. Impermanent is the first live benchmark that scores TSFM forecasts sequentially on a continuously updated data stream, testing whether foundation-model generalization actually holds once real time passes.
Every major time series foundation model benchmarked today, including Chronos, TimesFM, Moirai, and TiRex, has been evaluated on datasets that were frozen months or years before the evaluation was run. That sounds benign, but it creates a subtle and compounding problem: the longer a benchmark exists, the more likely it is that a model's pretraining data overlaps with its test set, either directly through shared datasets or indirectly through internet crawls that capture publicly posted benchmark data. The benchmark that was clean at launch drifts toward contamination with every passing month.
That is one of the key motivations behind Impermanent (arXiv:2603.08707), a new benchmark introduced by researchers from TimeCopilot, ELLIS Institute Tubingen, Mila, and Amazon Web Services. Published on March 9, 2026, it is, to the authors' knowledge, the first live benchmark specifically designed to measure temporal generalization in time-series forecasting: the ability of a model to remain accurate as real time passes and distributions evolve. We now surface it on a dedicated Impermanent benchmark page inside the broader TSFM.ai benchmark hub, alongside FEV Bench, GIFT-Eval, and BOOM.
The Problem Impermanent Solves
Our existing post on TSFM benchmarking challenges covered several dimensions of the evaluation problem: dataset contamination, inconsistent protocols, cherry-picked results, and the gap between academic datasets and production data. Impermanent targets a distinct fifth dimension that static benchmarks structurally cannot address: temporal contamination and non-stationarity over time.
Static benchmarks like GIFT-Eval and FEV Bench attempt to solve data contamination by carefully selecting datasets that are unlikely to appear in any model's pretraining corpus. This is valuable and necessary, but it only protects against contamination at the moment the benchmark is published. Six months later, model developers know which datasets are in the benchmark, and future training runs can include them. A year later, the benchmark test data is effectively part of the public record. The contamination problem recurs, and no amount of careful dataset curation prevents it indefinitely.
There is a second, more fundamental issue: a single frozen evaluation cannot tell you whether a model's performance persists over time. Real-world forecasting is not a snapshot. It is an ongoing process where distributions shift, new patterns emerge, and models must remain accurate across structural breaks, regime changes, and external shocks. A model that excels on a 2023 snapshot of retail sales may fail on 2025 sales data shaped by different consumer behavior, supply chain dynamics, or macroeconomic conditions. Static benchmarks provide no evidence about this.
Impermanent addresses both problems simultaneously by moving to a prequential evaluation loop: forecasts are issued before their corresponding ground truth exists, scored only after outcomes arrive, and accumulated sequentially as real time passes. This eliminates temporal contamination by construction, because you cannot train on data that has not happened yet, and directly measures sustained performance under ongoing distributional change.
The Dataset: GitHub Software Activity
For its first instantiation, Impermanent uses GitHub open-source repository activity as the live data source. This is a clever and principled choice. GitHub event streams, archived continuously by GH Archive, are:
- Truly live. New data arrives every day.
- Highly non-stationary. Repository activity is shaped by releases, contributor turnover, external events, platform changes, and community trends.
- Diverse in dynamics. Some repositories show smooth growth; others spike sharply around major releases and then decay.
- Uncontaminated by design. No existing TSFM pretraining corpus includes future GitHub activity.
The benchmark tracks four event types for the top 400 repositories by star count:
- Issues opened
- Pull requests opened
- Push events
- New stargazers
For each repository-event pair, a univariate time series is constructed and evaluated at four forecast frequencies: hourly (24-step horizon), daily (7-step), weekly (4-step), and monthly (1-step). Across 400 repositories, 4 event types, and 4 frequencies, this yields 6,400 individual series evaluated continuously.
The Evaluation Protocol
The evaluation follows a strict prequential protocol. At each cutoff date:
- Each model receives a context window of historical observations.
- The model must produce point forecasts and probabilistic forecasts for the next h time steps.
- No ground truth is available at forecast time. The model must commit to its predictions.
- When outcomes arrive, forecasts are scored.
- Cutoffs advance by exactly one horizon, ensuring no overlap.
Two metrics are used:
- MASE (Mean Absolute Scaled Error): normalizes point forecast error against a seasonal naive baseline. A MASE of 1.0 equals the naive baseline; below 1.0 beats it.
- Scaled CRPS (Continuous Ranked Probability Score): measures probabilistic forecast quality across nine quantile levels (0.1 through 0.9).
To make scores comparable across repositories with wildly different activity levels, both metrics are scaled by the "zero model" (a model that always predicts zero), using the 10th percentile of zero-model scores as a floor to prevent unstable ratios on near-zero denominators. The result is a score where lower is better and 1.0 represents zero-model performance.
Models Evaluated
Impermanent evaluates eleven models in three groups:
Naive baselines: ZeroModel, HistoricAverage, SeasonalNaive
Statistical models: AutoARIMA, AutoETS, AutoCES (Complex Exponential Smoothing), Dynamic Optimized Theta, Prophet
Foundation models: Chronos-2, Moirai 2.0-R-Small, TimesFM 2.5, TiRex
Foundation models run on A10G GPUs with batch size 64. Statistical models run on CPU. Each model outputs nine quantile forecasts for direct probabilistic comparison. Only open-source TSFMs with publicly released weights and reproducible inference code are included.
Early Results
In the paper's initial snapshot covering data through February 12, 2026, the results paint an encouraging picture for TSFM practitioners, with important nuance.
Foundation models occupy the top four positions on the overall leaderboard, ahead of all statistical methods. TimesFM 2.5 leads on three of four metric columns in the early data. This is a meaningful signal: even on highly non-stationary, live data from a domain entirely outside typical TSFM pretraining corpora, foundation models outperform statistical alternatives.
But calibration matters more than point accuracy alone. SeasonalNaive achieves a respectable MASE rank of 5.39 while posting a CRPS rank of 9.50, near the bottom. This divergence reveals that SeasonalNaive can match some point accuracy baselines on seasonal data but produces badly miscalibrated prediction intervals. In contrast, foundation models maintain stronger performance on both metrics simultaneously, suggesting their probabilistic outputs are more reliable across non-stationary regimes.
Rankings will shift. The authors are explicit that the February 2026 snapshot is an early data point, not a definitive verdict. As more cutoffs accumulate, models that maintain their advantage under ongoing distributional shift will distinguish themselves from models that happened to be well-suited to the specific conditions of the first evaluation window.
What Impermanent Changes for Practitioners
Evaluating your own model deployments. The prequential protocol that Impermanent applies to a benchmark is the same protocol that sound production evaluation uses. If you are backtesting a TSFM deployment, rolling-origin evaluation is the correct approach. Impermanent formalizes and demonstrates this approach at scale, providing a template for rigorous production validation.
Interpreting static benchmark rankings with appropriate skepticism. FEV Bench and GIFT-Eval tell you which models perform well on contamination-aware static datasets. BOOM tells you which models perform well on observability telemetry at a point in time. None of them tell you whether those rankings hold six months later on genuinely new data. Impermanent exists to answer that next question.
The right question to ask. Static benchmarks answer: "Which model performs best on data from this distribution, frozen at this moment?" Impermanent answers: "Which model stays best as time passes and the distribution changes?" For production deployments, the second question is the one that matters.
Concept drift is real and measurable. The GitHub dataset is explicitly non-stationary. Repository activity shifts as ecosystems change, which makes Impermanent a concrete way to track how model rankings evolve under real temporal drift.
How to Engage with Impermanent
The benchmark definition and public project surface are available at github.com/TimeCopilot/impermanent and impermanent.timecopilot.dev. On TSFM.ai, the benchmark already has a live home at /benchmarks/impermanent. That page now covers the benchmark's methodology, source links, related reading, and its place in the wider benchmark surface. The remaining future piece is automated leaderboard ingestion: the public feed is still early, so for now the page is optimized around benchmark discovery and interpretation rather than a fully populated row-level leaderboard.
Adding a new model requires implementing an inference interface compatible with the TimeCopilot evaluation harness. The authors note that future extensions will expand beyond GitHub activity to additional live data streams and will incorporate auxiliary contextual information to test covariate handling in live settings.
How This Connects to the TSFM.ai Benchmark Suite
We already surface FEV Bench, GIFT-Eval, and BOOM inside the TSFM.ai benchmark hub to help practitioners make informed model selections. With this update, Impermanent is part of that same product surface today, with its own route, sitemap coverage, and benchmark-specific internal linking. Each benchmark fills a distinct role: GIFT-Eval for comprehensive static coverage across domains, FEV Bench for pure zero-shot generalization on contamination-aware datasets, BOOM for production observability telemetry, and Impermanent for temporal robustness under live drift.
Impermanent adds a fourth dimension that the others cannot provide: evidence of sustained performance over time. As the benchmark matures and more cutoffs accumulate, it will become one of the most useful public signals for understanding which TSFMs are genuinely robust, not just well-calibrated to the current snapshot of benchmark datasets. The benchmark page already exists; what comes next is deeper row-level automation as the upstream feed becomes easier to consume reliably.
For teams evaluating which foundation model to deploy today, the short answer from early Impermanent results is consistent with what we see on static benchmarks: current-generation TSFMs outperform statistical methods even on highly non-stationary live data. But the more important takeaway is methodological. If you are not evaluating your deployed forecasting system with a rolling-origin, prequential protocol on genuinely new data, you are measuring past performance rather than present capability.