TSFM.ai developer documentation.
Multiple pages, one contract. API, MCP, and CLI are aligned on the same schema so teams can move from manual calls to production automation with zero drift.
Learn
How time series foundation models work
TSFMs use transformer architectures adapted for temporal data. The key challenge is converting continuous time series into a format that transformers can process, and then decoding the output back into useful forecasts.
Pre-training at scale
All TSFMs start with pre-training: exposing the model to massive collections of time series from diverse domains. Google trained TimesFM on 100B+ data points, Alibaba used 300B for Time-MoE, and Amazon used 30B+ with synthetic augmentation for Chronos. This diversity is critical — the model needs to see retail demand patterns, energy load curves, weather signals, financial prices, and web traffic to learn generalizable temporal features.
The training objective varies by architecture, but the goal is the same: learn representations that capture universal time series properties — trends, seasonality, noise levels, regime changes, and correlations — that transfer to new, unseen series at inference time.
Training objectives
How different models learn from raw time series data.
Token prediction (next-token)
Quantize values into bins and train the model to predict the next token. Same objective as language models. Used by Chronos.
Patch reconstruction
Mask random patches of the input series and train the model to reconstruct them. Similar to BERT's masked language modeling. Used by MOMENT.
Direct value regression
Train the model to directly predict continuous future values given context. Uses MSE or distribution-based loss. Used by TimesFM, Moirai.
Denoising / flow-matching
Add noise to real future values and train the model to remove it. At inference, start from pure noise and denoise into a forecast. Used by Sundial.
Distribution head
Train the model to output parameters of a probability distribution (e.g., Student-t, mixture of Gaussians). Used by Moirai, Lag-Llama.
Architecture patterns
Each TSFM family takes a different approach to processing time series data. Here are the major patterns available on TSFM.ai.
Encoder-Decoder (T5-style)
Used by: Chronos, Chronos-2
Quantizes continuous time series values into discrete token bins and processes them through a T5-style encoder-decoder transformer. The encoder reads the full context window; the decoder autoregressively generates future tokens which are mapped back to continuous values. Naturally produces probabilistic outputs by sampling multiple trajectories.
Strengths
- Strong probabilistic calibration
- Native covariate support (v2)
- Well-studied architecture
Tradeoffs
- Autoregressive decoding adds latency per step
- Quantization introduces discretization error
Decoder-Only (Patched Input)
Used by: TimesFM, Lag-Llama, Toto
Groups consecutive time points into patches (similar to Vision Transformer patches) and feeds them as tokens to a decoder-only transformer. The model autoregressively predicts the next patch of values. Patching reduces sequence length and computational cost while preserving local structure.
Strengths
- Efficient long-context handling via patching
- Leverages standard LLM infrastructure
- Good at capturing local patterns
Tradeoffs
- Patch boundaries can miss fine-grained transitions
- Autoregressive generation for long horizons
Direct Multi-Step Prediction
Used by: Chronos-Bolt
Instead of generating one token at a time, predicts the entire forecast horizon in a single forward pass. The model maps the input context directly to all future steps simultaneously. This eliminates the autoregressive bottleneck and achieves 3-5x faster inference.
Strengths
- Dramatically lower latency
- No error accumulation across steps
- Ideal for real-time applications
Tradeoffs
- Fixed horizon length at inference
- May sacrifice some accuracy on very long horizons
Masked Encoder (Any-Variate)
Used by: Moirai (Small, Base, Large)
Uses a masked encoder architecture with Any-Variate Attention — a mechanism that handles arbitrary numbers of input variates without assuming channel independence. Each variate is treated as a separate token sequence, and cross-variate attention captures dependencies between them.
Strengths
- True multivariate modeling
- Flexible variate count at inference
- Strong on correlated series
Tradeoffs
- Quadratic attention cost with many variates
- Requires more GPU memory for large multivariate sets
Mixture of Experts (MoE)
Used by: Time-MoE
Routes inputs through a sparse set of expert subnetworks. Only a fraction of the total parameters are active for any given input, selected by a learned routing mechanism. Different experts specialize in different temporal patterns or domains, enabling strong transfer without proportional compute cost.
Strengths
- Large total capacity with modest compute
- Domain specialization via expert routing
- Strong cross-domain transfer
Tradeoffs
- Load balancing across experts requires careful training
- Total parameter count is much larger than active count
MLP-Mixer
Used by: Granite TTM
Replaces attention with MLP-based mixing layers that alternate between mixing across time steps and mixing across features. Extremely parameter-efficient — achieves competitive accuracy with roughly 1M parameters compared to hundreds of millions for transformer-based models.
Strengths
- Ultra-low latency (<100ms)
- CPU-friendly inference
- Tiny model size
Tradeoffs
- Limited context length
- Less expressive than attention for complex patterns
Diffusion / Flow-Matching
Used by: Sundial
Generates forecasts by iteratively denoising from random noise into a plausible future trajectory. Flow-matching variants use a continuous-time framework that requires fewer denoising steps than standard diffusion. Produces full predictive distributions by generating multiple samples.
Strengths
- Rich distributional outputs
- Well-calibrated uncertainty
- Captures multi-modal futures
Tradeoffs
- Multiple forward passes per sample
- Higher inference cost than single-pass models
LLM Reprogramming
Used by: Time-LLM
Takes a pre-trained large language model (e.g., LLaMA-7B) and reprograms it for time series by converting series data into text-like token sequences via Prompt-as-Prefix. The LLM's pre-trained reasoning capabilities are repurposed for temporal pattern recognition without retraining most parameters.
Strengths
- Leverages LLM reasoning capabilities
- Large pre-trained knowledge base
- Unique approach to transfer learning
Tradeoffs
- Very high inference cost (7B parameters)
- Latency makes it impractical for real-time use
- Resource-intensive deployment
Zero-shot vs. fine-tuning
Most TSFMs are designed for zero-shot use — send your data, get a forecast, no training required. This works well when your data resembles patterns the model has seen during pre-training (which covers most common domains). Fine-tuning becomes valuable when you have a large, domain-specific dataset and need maximum accuracy — for example, forecasting a proprietary financial signal that has unique statistical properties.
Start with zero-shot
No training needed. Works immediately. Iterate on which model and parameters give the best results for your data before investing in fine-tuning.
Fine-tune when needed
If zero-shot accuracy plateaus and you have 10K+ observations, fine-tuning can squeeze out additional performance on your specific distribution.
Use cases
See how these architectures are applied to real-world forecasting, anomaly detection, and classification problems.
Continue reading
Choosing a model
Use the architecture and capability tradeoffs to select the right model for your specific requirements.
Selection guide