conformal-predictionuncertaintycalibrationprobabilisticpractical-guide

Conformal Prediction: Calibrated Uncertainty Intervals for Time Series Foundation Models

Model-native prediction intervals are often miscalibrated. Conformal prediction provides a distribution-free wrapper that turns any forecaster's output into intervals with guaranteed coverage.

T
TSFM.ai Team
February 5, 20255 min read

When a time series foundation model outputs a 90% prediction interval, you might reasonably expect the true value to land inside that interval 90% of the time. In practice, it often does not. Empirical studies routinely show that model-native intervals, whether produced by Chronos's quantile sampling, Moirai's distribution heads, or Lag-Llama's probabilistic outputs, can exhibit substantial miscalibration. A nominal 90% interval might only achieve 75% coverage on a new domain, or it might be so wide that it covers 99% of outcomes while providing little decision value. This is especially acute in zero-shot settings where the model encounters data distributions it was never explicitly trained on.

Conformal prediction offers a principled fix. It is a distribution-free statistical framework that wraps any forecaster, whether it produces point predictions or full quantile forecasts, and produces intervals with provable finite-sample coverage guarantees. No assumptions about the data distribution are required beyond a mild exchangeability condition.

How Conformal Prediction Works

The core idea is remarkably simple. You start with a forecaster and a held-out calibration set that the model has not seen during training or tuning. On this calibration set, you compute nonconformity scores, which measure how much each true observation deviates from the model's prediction. For a point forecaster, the nonconformity score is typically the absolute residual. For a quantile forecaster, it might be the distance by which the true value falls outside the predicted interval.

Once you have these scores, you sort them and find the empirical quantile corresponding to your desired coverage level. If you want 90% coverage and have 100 calibration points, you take the 90th largest nonconformity score as your critical value. You then construct the prediction interval for a new observation by adding and subtracting this critical value from the point forecast. A foundational result in conformal prediction theory guarantees that this interval will contain the next observation with at least the desired probability, provided the calibration and test data are exchangeable.

This is called split conformal prediction, and its simplicity is its greatest strength. It works with any black-box model. There is no need to retrain, modify architectures, or make distributional assumptions. You just need a calibration set and a few lines of postprocessing code.

The Challenge of Non-Stationary Time Series

Standard split conformal prediction assumes exchangeability: the calibration and test data are drawn from the same distribution in no particular order. Time series data violates this assumption. Observations are ordered, autocorrelated, and often non-stationary. A calibration set from January may not represent the error distribution you will see in July.

Adaptive Conformal Inference (ACI), introduced by Gibbs and Candes, addresses this by dynamically adjusting the coverage threshold over time. When the model starts making more errors than expected, ACI widens the intervals. When it is performing well, intervals tighten. The adjustment follows a simple online update rule with a single learning rate parameter, making it easy to implement and tune.

Ensemble Batch Prediction Intervals (EnbPI), proposed by Xu and Xie, takes a different approach designed specifically for time series. EnbPI trains an ensemble of bootstrap models and computes prediction intervals by aggregating leave-one-out residuals. Crucially, it updates residuals in a rolling fashion as new data arrives, allowing the intervals to adapt to distribution shifts without requiring the exchangeability assumption. This makes it particularly well-suited for production forecasting systems where data characteristics evolve continuously.

Why This Matters for Foundation Models

Classical statistical models like ARIMA or ETS are typically fit to a specific series, and their uncertainty estimates, while imperfect, are at least calibrated to that particular data. Time series foundation models operate differently. They are pretrained on massive corpora spanning hundreds of domains, then applied zero-shot to entirely new series. There is no guarantee that the uncertainty learned during pretraining transfers correctly to your specific retail demand signal, sensor stream, or energy load curve.

This is where conformal prediction acts as a safety net. Regardless of how well or poorly the model's native intervals are calibrated on your data, the conformal wrapper corrects them using evidence from your own domain. The benchmarking challenges that make it hard to evaluate TSFMs across domains also make it hard to trust their raw uncertainty estimates. Conformal methods sidestep this problem entirely.

A Practical Recipe

Here is a concrete workflow for deploying conformally calibrated TSFM forecasts:

  1. Generate forecasts. Run your chosen model through the TSFM.ai forecast API to obtain point forecasts or quantile predictions. Any of the supported models work, including Chronos Bolt or Moirai.

  2. Set aside calibration data. Reserve a recent window of your time series as a calibration set. This should be data the model has not been fine-tuned on. The size matters: more calibration data yields tighter intervals, but even 50 to 100 points provide useful guarantees.

  3. Compute nonconformity scores. For each calibration point, compute the absolute residual between the forecast and the actual value. If you are working with quantile forecasts, compute the signed deviation from the interval bounds instead.

  4. Construct conformal intervals. Sort the nonconformity scores and select the quantile corresponding to your target coverage. Add this critical value to your production forecasts as a symmetric adjustment, or apply ACI for an adaptive version that tracks changing error patterns.

  5. Deploy and monitor. In production, track empirical coverage over rolling windows. If using ACI, the intervals will self-correct. If using static split conformal, recalibrate periodically as more data accumulates.

Trade-Offs to Consider

Conformal prediction is not free. Tighter, more informative intervals demand more calibration data. With only 20 calibration points, a 90% conformal interval will be conservative, essentially padded to account for the small sample. With 500 calibration points, the intervals will be much closer to the true uncertainty. There is also a tension between adaptivity and stability: ACI intervals can fluctuate rapidly if the learning rate is set too aggressively, which may confuse downstream decision systems that expect smooth uncertainty bands.

The most important trade-off is philosophical. Conformal prediction guarantees marginal coverage, meaning the intervals are correct on average across all future time points. They do not guarantee conditional coverage at every individual step. If your application requires that the interval is correctly calibrated specifically during demand spikes or regime changes, you may need to combine conformal methods with domain-specific stratification.

Despite these caveats, conformal prediction remains one of the most practical tools available for turning a foundation model's raw output into intervals you can actually trust. It requires no model modification, works with any forecaster, and delivers finite-sample guarantees that no amount of architecture tuning can match. For teams deploying TSFMs in high-stakes environments, it should be a standard part of the postprocessing pipeline.

Related articles