Public Docs
OpenAPI Source of Truth
MCP Streamable HTTP
CLI for Consumers

TSFM.ai developer documentation.

Multiple pages, one contract. API, MCP, and CLI are aligned on the same schema so teams can move from manual calls to production automation with zero drift.

Learn

How time series foundation models work

TSFMs use transformer architectures adapted for temporal data. The key challenge is converting continuous time series into a format that transformers can process, and then decoding the output back into useful forecasts.

Pre-training at scale

All TSFMs start with pre-training: exposing the model to massive collections of time series from diverse domains. Google trained TimesFM on 100B+ data points, Alibaba used 300B for Time-MoE, and Amazon used 30B+ with synthetic augmentation for Chronos. This diversity is critical — the model needs to see retail demand patterns, energy load curves, weather signals, financial prices, and web traffic to learn generalizable temporal features.

The training objective varies by architecture, but the goal is the same: learn representations that capture universal time series properties — trends, seasonality, noise levels, regime changes, and correlations — that transfer to new, unseen series at inference time.

Training objectives

How different models learn from raw time series data.

Token prediction (next-token)

Quantize values into bins and train the model to predict the next token. Same objective as language models. Used by Chronos.

Patch reconstruction

Mask random patches of the input series and train the model to reconstruct them. Similar to BERT's masked language modeling. Used by MOMENT.

Direct value regression

Train the model to directly predict continuous future values given context. Uses MSE or distribution-based loss. Used by TimesFM, Moirai.

Denoising / flow-matching

Add noise to real future values and train the model to remove it. At inference, start from pure noise and denoise into a forecast. Used by Sundial.

Distribution head

Train the model to output parameters of a probability distribution (e.g., Student-t, mixture of Gaussians). Used by Moirai, Lag-Llama.

Architecture patterns

Each TSFM family takes a different approach to processing time series data. Here are the major patterns available on TSFM.ai.

Encoder-Decoder (T5-style)

Used by: Chronos, Chronos-2

Quantizes continuous time series values into discrete token bins and processes them through a T5-style encoder-decoder transformer. The encoder reads the full context window; the decoder autoregressively generates future tokens which are mapped back to continuous values. Naturally produces probabilistic outputs by sampling multiple trajectories.

Strengths

  • Strong probabilistic calibration
  • Native covariate support (v2)
  • Well-studied architecture

Tradeoffs

  • Autoregressive decoding adds latency per step
  • Quantization introduces discretization error

Decoder-Only (Patched Input)

Used by: TimesFM, Lag-Llama, Toto

Groups consecutive time points into patches (similar to Vision Transformer patches) and feeds them as tokens to a decoder-only transformer. The model autoregressively predicts the next patch of values. Patching reduces sequence length and computational cost while preserving local structure.

Strengths

  • Efficient long-context handling via patching
  • Leverages standard LLM infrastructure
  • Good at capturing local patterns

Tradeoffs

  • Patch boundaries can miss fine-grained transitions
  • Autoregressive generation for long horizons

Direct Multi-Step Prediction

Used by: Chronos-Bolt

Instead of generating one token at a time, predicts the entire forecast horizon in a single forward pass. The model maps the input context directly to all future steps simultaneously. This eliminates the autoregressive bottleneck and achieves 3-5x faster inference.

Strengths

  • Dramatically lower latency
  • No error accumulation across steps
  • Ideal for real-time applications

Tradeoffs

  • Fixed horizon length at inference
  • May sacrifice some accuracy on very long horizons

Masked Encoder (Any-Variate)

Used by: Moirai (Small, Base, Large)

Uses a masked encoder architecture with Any-Variate Attention — a mechanism that handles arbitrary numbers of input variates without assuming channel independence. Each variate is treated as a separate token sequence, and cross-variate attention captures dependencies between them.

Strengths

  • True multivariate modeling
  • Flexible variate count at inference
  • Strong on correlated series

Tradeoffs

  • Quadratic attention cost with many variates
  • Requires more GPU memory for large multivariate sets

Mixture of Experts (MoE)

Used by: Time-MoE

Routes inputs through a sparse set of expert subnetworks. Only a fraction of the total parameters are active for any given input, selected by a learned routing mechanism. Different experts specialize in different temporal patterns or domains, enabling strong transfer without proportional compute cost.

Strengths

  • Large total capacity with modest compute
  • Domain specialization via expert routing
  • Strong cross-domain transfer

Tradeoffs

  • Load balancing across experts requires careful training
  • Total parameter count is much larger than active count

MLP-Mixer

Used by: Granite TTM

Replaces attention with MLP-based mixing layers that alternate between mixing across time steps and mixing across features. Extremely parameter-efficient — achieves competitive accuracy with roughly 1M parameters compared to hundreds of millions for transformer-based models.

Strengths

  • Ultra-low latency (<100ms)
  • CPU-friendly inference
  • Tiny model size

Tradeoffs

  • Limited context length
  • Less expressive than attention for complex patterns

Diffusion / Flow-Matching

Used by: Sundial

Generates forecasts by iteratively denoising from random noise into a plausible future trajectory. Flow-matching variants use a continuous-time framework that requires fewer denoising steps than standard diffusion. Produces full predictive distributions by generating multiple samples.

Strengths

  • Rich distributional outputs
  • Well-calibrated uncertainty
  • Captures multi-modal futures

Tradeoffs

  • Multiple forward passes per sample
  • Higher inference cost than single-pass models

LLM Reprogramming

Used by: Time-LLM

Takes a pre-trained large language model (e.g., LLaMA-7B) and reprograms it for time series by converting series data into text-like token sequences via Prompt-as-Prefix. The LLM's pre-trained reasoning capabilities are repurposed for temporal pattern recognition without retraining most parameters.

Strengths

  • Leverages LLM reasoning capabilities
  • Large pre-trained knowledge base
  • Unique approach to transfer learning

Tradeoffs

  • Very high inference cost (7B parameters)
  • Latency makes it impractical for real-time use
  • Resource-intensive deployment

Zero-shot vs. fine-tuning

Most TSFMs are designed for zero-shot use — send your data, get a forecast, no training required. This works well when your data resembles patterns the model has seen during pre-training (which covers most common domains). Fine-tuning becomes valuable when you have a large, domain-specific dataset and need maximum accuracy — for example, forecasting a proprietary financial signal that has unique statistical properties.

Start with zero-shot

No training needed. Works immediately. Iterate on which model and parameters give the best results for your data before investing in fine-tuning.

Fine-tune when needed

If zero-shot accuracy plateaus and you have 10K+ observations, fine-tuning can squeeze out additional performance on your specific distribution.

Use cases

See how these architectures are applied to real-world forecasting, anomaly detection, and classification problems.

Continue reading

Choosing a model

Use the architecture and capability tradeoffs to select the right model for your specific requirements.

Selection guide