Concepts

What are time series foundation models?

Learn the core concepts behind TSFMs, why they matter, and how they compare with classical forecasting approaches.

The core idea

A time series foundation model (TSFM) is a large neural network pre-trained on billions of time series observations from diverse domains — retail sales, energy consumption, weather, traffic, finance, and more. By learning general temporal patterns (trends, seasonality, noise structures, regime changes) across this vast corpus, TSFMs can then forecast, detect anomalies, or classify new series they have never encountered before.

Think of it like GPT for time series. Just as a language model learns grammar, semantics, and reasoning from reading the internet, a TSFM learns the grammar of temporal data from observing millions of real-world signals. When you send it a new series, it already understands concepts like weekly cycles, holiday effects, trend changes, and noise — without being explicitly told about any of them.

Why this matters

No per-series training

Classical methods like ARIMA or Prophet require fitting a separate model for each series. With thousands of SKUs or sensors, that becomes an operational nightmare. TSFMs handle them all with a single model.

Works with limited data

New products, new sensors, and new markets often lack sufficient history for classical methods. TSFMs generalize from pre-training, producing reasonable forecasts from just a handful of observations.

Built-in uncertainty

Most TSFMs produce full probabilistic distributions, not just point forecasts. You get prediction intervals out of the box, which is critical for inventory planning, risk management, and capacity allocation.

Classical methods vs. foundation models

How TSFMs compare to traditional approaches like ARIMA, Prophet, ETS, and tree-based methods.

Dimension	Classical	Foundation model
Training	Fit one model per series using only that series' history	Pre-train once on billions of observations across thousands of diverse series
Generalization	Requires retraining or reconfiguration for new series	Zero-shot inference on unseen series without any training
Data requirements	Needs sufficient history per series (often 2+ seasonal cycles)	Works with as few as 10-20 observations due to pre-trained priors
Compute	Lightweight per model, but N models for N series	One model handles all series; GPU inference amortized across requests
Uncertainty	Varies — some produce intervals, many produce only point forecasts	Most produce full probabilistic distributions natively
Multi-task	Separate tools for forecasting, anomaly detection, imputation	Some models handle multiple tasks from a single checkpoint

Key properties of TSFMs

Zero-shot generalization

The defining property. A TSFM trained on energy data can forecast retail demand without retraining. The model has internalized general time series patterns that transfer across domains.

Probabilistic outputs

Rather than a single forecast line, TSFMs generate distributions. You can extract any quantile — the 10th percentile for worst-case planning, the 90th for capacity budgeting, and the median for expected outcomes.

Multi-task capability

Some TSFMs (like MOMENT) can forecast, detect anomalies, classify series types, and impute missing values — all from a single pre-trained checkpoint. This simplifies the ML stack significantly.

Scale efficiency

One TSFM replaces hundreds or thousands of individual models. Inference is GPU-batched and amortized. The operational cost of managing, monitoring, and updating one model is far lower than managing one per series.

Terminology reference

Key concepts used throughout these docs and the TSFM.ai API.

Context window

The number of historical observations the model can see as input. Larger windows let the model capture longer seasonal patterns. Measured in tokens or data points.

Prediction horizon

How many future steps the model generates. Also called forecast length or prediction length. Typically 12-64 steps depending on the model and use case.

Quantile forecasting

Instead of a single point estimate, the model outputs values at specific probability levels (e.g., 10th, 50th, 90th percentile). This gives you prediction intervals that quantify uncertainty.

Zero-shot inference

Using a pre-trained model on new data it has never seen, without any fine-tuning. The model generalizes from patterns learned during pre-training.

Patching / tokenization

The process of converting raw time series values into discrete tokens that a transformer can process. Different models use different strategies — some quantize values into bins, others group consecutive points into patches.

Covariates

External variables that influence the target series. Past covariates are historical (e.g., past promotions), future covariates are known ahead of time (e.g., scheduled holidays).

Probabilistic distribution

A complete description of the range of possible future values and their likelihoods. Goes beyond point forecasts to capture the full shape of uncertainty.

Fine-tuning

Adapting a pre-trained model to a specific dataset or domain by training for additional steps on that data. Trades compute for improved accuracy on the target distribution.

Next steps

How TSFMs work

Dive into architecture patterns, training approaches, and how time series data is represented inside transformers.

Use cases

See how TSFMs are applied across industries — from demand planning to infrastructure monitoring.