foundation-modelsexplainerdeep-learning

What Are Time Series Foundation Models?

An introduction to time series foundation models — what they are, how they work, and why they represent a paradigm shift in forecasting.

T
TSFM.ai Team
January 18, 20243 min read

For decades, time series forecasting followed a familiar playbook: collect domain-specific data, select an appropriate model family, train it, tune hyperparameters, and deploy. Whether you reached for ARIMA, Prophet, DeepAR, or N-BEATS, the workflow was fundamentally task-specific. Each new dataset required a new model. Each new domain demanded fresh expertise.

Time series foundation models (TSFMs) upend that paradigm entirely.

The Foundation Model Pattern

The idea behind TSFMs borrows directly from the revolution that large language models brought to natural language processing. GPT, BERT, and their successors demonstrated that a single model, pretrained on a massive and diverse text corpus, could generalize to tasks it was never explicitly trained on. Translation, summarization, question answering — all became accessible through a single pretrained artifact.

TSFMs apply the same logic to temporal data. Instead of pretraining on text, these models ingest enormous corpora of diverse time series spanning finance, energy, weather, retail, transportation, healthcare, and more. The hypothesis is that temporal patterns — trends, seasonality, level shifts, volatility clustering — share structural similarities across domains. A model that has seen enough variety can learn a general-purpose representation of how sequences evolve over time.

How TSFMs Work

Most modern TSFMs are built on transformer architectures, adapted for continuous-valued sequential data. The central engineering challenge is tokenization: unlike words in a vocabulary, time series values are continuous and unbounded. Different models handle this differently.

Patching is the most common strategy. Rather than feeding individual time steps into the transformer, the input series is divided into fixed-length or variable-length patches — contiguous subsequences that serve as the "tokens" of the model. This approach, popularized by PatchTST (Nie et al., 2023), reduces sequence length and lets the self-attention mechanism operate over semantically meaningful chunks rather than raw data points.

Discretization offers an alternative. Amazon's Chronos model bins continuous values into a fixed vocabulary of discrete tokens, then applies a standard language model architecture (T5) to the resulting token sequence. This lets the model use cross-entropy loss and categorical sampling, tools well understood from the NLP toolkit.

Other architectural choices vary by model. Some TSFMs use decoder-only transformers (following the GPT lineage), while others use encoder-decoder designs. Some produce point forecasts; others generate full probabilistic distributions through trajectory sampling.

The Current Landscape

The TSFM field has moved rapidly. Several notable models have emerged, each with a distinct approach:

  • Chronos (Amazon, 2024): Encoder-decoder T5 architecture with value tokenization. Produces probabilistic forecasts. Pretrained on a mix of public datasets and synthetic Gaussian process data. Released openly on Hugging Face in sizes ranging from 20M to 710M parameters. (paper)

  • TimesFM (Google, 2024): Decoder-only transformer pretrained on roughly 100 billion real-world time points sourced from Google Trends, Wikipedia pageviews, and synthetic augmentation. Uses input and output patching to handle variable-length horizons efficiently. (paper)

  • Moirai (Salesforce, 2024): A universal forecasting transformer designed to handle variable frequencies, prediction lengths, and numbers of variates in a single model. Uses mixture distributions for flexible probabilistic output. (paper)

  • Lag-Llama (2023): A decoder-only transformer that uses lagged features as covariates, drawing on the LLaMA architecture. Focuses on probabilistic forecasting and demonstrates strong transfer learning across domains. (paper)

  • MOMENT (CMU, 2024): Positions itself as a family of foundation models for general-purpose time series analysis, supporting forecasting, classification, anomaly detection, and imputation from a single pretrained backbone. (paper)

Zero-Shot Capability

The most practically significant property of TSFMs is zero-shot forecasting: the ability to generate predictions on entirely new time series without any fine-tuning or retraining. You pass in a context window of historical values, and the model returns a forecast.

This eliminates the cold-start problem that has plagued production forecasting systems for years. There is no need to accumulate months of historical data before a model becomes useful. There is no need to maintain per-series model pipelines. A single pretrained model can serve thousands of distinct series across unrelated business domains.

For practitioners, this translates directly into reduced time-to-production, lower infrastructure costs, and fewer specialized ML engineering hours per forecasting use case.

What Comes Next

TSFMs are still in the early stages of maturation. Current limitations include constraints on context length, gaps in multivariate support, and open questions about when zero-shot performance is sufficient versus when domain-specific fine-tuning is necessary. Benchmark standards are still being established, and head-to-head comparisons across models remain an active area of research.

But the trajectory is clear. Just as NLP moved from task-specific word embeddings to general-purpose language models, time series analysis is converging on foundation models as the default starting point. The tooling, benchmarks, and deployment infrastructure around TSFMs will continue to mature — and platforms like TSFM.ai are building the ecosystem to make that transition practical for engineering teams today. Explore the available models on our model catalog.

Related articles