lag-llamaopen-sourcemodel-architectureprobabilistic-forecasting

Lag-Llama: The Open-Source Time Series Foundation Model

Lag-Llama brings the decoder-only LLM architecture to time series with lag-based tokenization and distributional outputs.

TSFM.ai Team

October 10, 20245 min read

The explosion of foundation models in time series has produced a spectrum of architectural approaches: patching and binning in Chronos, masked encoding in Moirai, residual patching in TimesFM. Lag-Llama, introduced by Rasul et al. at ServiceNow Research in late 2023, takes a different path. It applies the decoder-only transformer architecture, directly inspired by Meta's LLaMA, to time series forecasting using lag features as input tokens. The result is a compact, probabilistic model that punches well above its parameter count.

#Architecture: Decoder-Only with Lag Tokenization

Most time series foundation models face the same fundamental question: how do you turn a continuous signal into a sequence of tokens that a transformer can process? Chronos bins values into discrete categories. TimesFM and Moirai use patching to group contiguous timesteps. Lag-Llama takes a third approach: it constructs each token from the lag features at a given timestep.

At each time step t, the model input is not simply the value x(t) but a vector of lagged values: x(t-1), x(t-2), x(t-3), ..., x(t-7), x(t-14), x(t-28), x(t-364), and so on. The lag indices are deliberately chosen to capture common seasonal patterns. For daily data, lags at 7, 14, and 28 capture weekly and monthly cycles; the lag at 364 captures annual seasonality. For hourly data, a different set of lag indices would emphasize 24-hour and 168-hour (weekly) cycles.

This lag vector, along with the current value and time-derived covariates, is projected through a linear layer to create the token embedding. The decoder-only transformer then processes these tokens autoregressively, attending only to past positions. The architecture uses RMSNorm for layer normalization and RoPE (Rotary Position Embeddings) for position encoding, both borrowed directly from the LLaMA architecture.

#Why Lags Instead of Patches?

The lag-based approach has a specific advantage: it provides the model with explicit access to seasonally relevant historical values at every step. A patch-based model must learn to extract seasonal information from the raw signal within and across patches. A lag-based model gets seasonality served on a platter.

This design choice also means Lag-Llama does not require the input series to be segmented into fixed-length patches, which simplifies handling of series with varying lengths. Each timestep is its own token, with the lags providing the necessary context about the series history.

The trade-off is sequence length. Since each timestep is a separate token, the effective sequence length equals the context window length. Patch-based models compress this by a factor equal to the patch size, allowing them to process much longer histories within the same attention window. For problems requiring very long context (thousands of timesteps), this is a meaningful limitation.

#Model Size and Training

Lag-Llama is notably compact. The published Hugging Face checkpoint is about 2.45M parameters, making it one of the smallest TSFMs in the current landscape. For comparison, Chronos-Large has 710M parameters, Moirai-Large has 311M, and TimesFM has 200M. Despite this size difference, Lag-Llama achieves competitive results on standard benchmarks, which speaks to the efficiency of its lag-based representation.

Training was conducted on a diverse corpus assembled from the GluonTS dataset collection, spanning domains including retail, energy, traffic, and economics. The training procedure uses teacher forcing with a standard autoregressive next-step prediction objective. The relatively small model size means training is feasible on a single GPU, which has been a significant factor in community adoption.

#Probabilistic Output via Distribution Heads

Unlike models that produce point forecasts and rely on post-hoc methods for uncertainty, Lag-Llama is inherently probabilistic. The output at each timestep is not a single predicted value but the parameters of a Student-t distribution: location (mu), scale (sigma), and degrees of freedom (nu).

The Student-t distribution was chosen deliberately over a Gaussian because its heavier tails better accommodate the occasional large deviations common in real-world time series. The degrees-of-freedom parameter allows the model to adaptively control tail heaviness: high nu approximates a Gaussian for well-behaved series, while low nu produces wider tails for volatile ones.

Forecast samples are drawn from the predicted distribution at each step, and these samples are fed back autoregressively to generate the next step. Running multiple sample paths produces a set of forecast trajectories from which you can extract arbitrary quantiles.

#Fine-Tuning: Where Lag-Llama Shines

A key design goal of Lag-Llama is efficient fine-tuning. The small parameter count means fine-tuning on a target dataset is fast and memory-efficient. The authors demonstrate that fine-tuning Lag-Llama on as few as a hundred series from a target domain can significantly boost accuracy compared to zero-shot inference.

This positions Lag-Llama as a strong choice for the pretrain-then-fine-tune workflow. Use the pretrained weights as a starting point that captures general time series dynamics, then adapt to your specific domain with a modest amount of domain data. The fine-tuning process typically converges in a few hundred gradient steps, taking minutes rather than hours.

#Benchmark Results and Comparisons

On the Monash Forecasting Archive and GluonTS benchmarks, Lag-Llama achieves results competitive with much larger models on many datasets, particularly for short-to-medium horizons. Its probabilistic forecasts, evaluated via CRPS (Continuous Ranked Probability Score), are well-calibrated relative to its model size.

Compared to Chronos, Lag-Llama avoids the information loss inherent in binning continuous values and produces continuous distributional outputs rather than categorical ones. Compared to TimesFM, which focuses on point forecast accuracy, Lag-Llama provides native probabilistic forecasts but may trail on pure median accuracy for some datasets. Compared to Moirai, Lag-Llama lacks multivariate capability (it is univariate only) but is far cheaper to run and fine-tune.

#Limitations

The main limitations are clear. Lag-Llama is univariate only, so it cannot model dependencies between related series. The fixed lag index set means performance depends on whether the chosen lags align with the actual seasonality of the target series. And the per-timestep tokenization limits the effective context length compared to patch-based models.

#Availability and Community

Lag-Llama is fully open source, with pretrained weights and training code available on GitHub and Hugging Face. The model integrates with GluonTS for data handling and PyTorch for training and inference. Its small size and straightforward architecture have made it a popular starting point for researchers experimenting with TSFM fine-tuning and adaptation techniques. On TSFM.ai, Lag-Llama is available for both zero-shot inference and as a fine-tuning base through our model API.

Lag-Llama: The Open-Source Time Series Foundation Model

#Architecture: Decoder-Only with Lag Tokenization

#Why Lags Instead of Patches?

#Model Size and Training

#Probabilistic Output via Distribution Heads

#Fine-Tuning: Where Lag-Llama Shines

#Benchmark Results and Comparisons

#Limitations

#Availability and Community

Run these models on your own data

Related articles

A Deep Dive into Amazon Chronos

No Adjacency Matrix Required: TSFMs as Strong Baselines in Transportation Forecasting

ForecastOps: Local-First Observability for Production Forecasts