tokenizationarchitecturemodel-designtechnical-deep-dive

Tokenization Strategies Compared: Quantization vs Patching vs Lag-Based

Time series foundation models must bridge the gap between continuous signals and discrete transformer inputs. We compare three dominant tokenization strategies — quantization, patching, and lag-based — and when each one works best.

T
TSFM.ai Team
July 8, 20244 min read

Transformers operate on sequences of discrete tokens. Language models tokenize text into subwords, vision transformers split images into patches, and audio models chop waveforms into spectral frames. But time series data is inherently continuous: a stream of floating-point values sampled at some frequency, with no natural vocabulary or spatial grid to lean on. The question of how to convert this continuous signal into a token sequence is not a preprocessing detail — it is a fundamental architectural decision that shapes what the model can learn, how efficiently it scales, and which forecasting tasks it handles well.

Across the current generation of time series foundation models, three tokenization strategies have emerged as dominant approaches. Each reflects a different philosophy about what information a token should carry.

Quantization and Binning

The most direct approach is to make time series look like language. Chronos, developed by Amazon and described in Ansari et al. (2024), scales each input series by its mean absolute value, then maps the normalized values into one of B discrete bins (typically B = 4096). Each bin corresponds to a token ID, producing an integer sequence that can be fed directly into a standard T5 encoder-decoder architecture. The model outputs a categorical distribution over the bin vocabulary at each decoding step, and multiple sampled trajectories yield probabilistic forecasts.

The advantage is architectural simplicity: any pretrained language model backbone can be repurposed with minimal modification. The scaling step ensures that series of different magnitudes share the same vocabulary, enabling zero-shot generalization across domains.

The trade-off is information loss. Binning compresses a continuous value into one of a finite number of buckets, and resolution is bounded by vocabulary size. Increasing B improves fidelity but inflates the softmax computation at every decoding step. Additionally, the scaling procedure assumes a single global scale per series, which can be problematic for series with regime changes or heteroscedastic variance. Despite these limitations, Chronos demonstrates that even coarse discretization preserves enough structure for competitive forecasting accuracy.

Patching

Patching groups contiguous timesteps into fixed-size windows, where each window becomes a single token. Introduced for time series by PatchTST (Nie et al., 2023), this strategy has become the most widely adopted tokenization approach. MOMENT uses ViT-style non-overlapping patches with a masked reconstruction objective (Goswami et al., 2024). TimesFM applies patched input tokenization within its decoder architecture. Moirai extends patching with its Any-Variate Attention mechanism to handle multivariate inputs.

The core benefit is computational. For a series of length L with patch size P, the transformer processes L/P tokens instead of L. Since self-attention cost scales quadratically with sequence length, patching yields an effective quadratic reduction in attention cost. A patch size of 16 over a 512-step context means 32 tokens rather than 512, cutting attention FLOPs by roughly 256x. This enables models to ingest much longer histories within the same compute budget.

Patches also serve as natural feature extractors. Each patch token captures local temporal structure — short-term trends, level shifts, and intra-patch dynamics — through a linear projection layer. The transformer then learns inter-patch relationships via attention, separating local pattern recognition from global dependency modeling.

The limitations are boundary effects and fixed granularity. Events that fall across a patch boundary are split between two tokens, potentially making them harder for the model to detect. The patch size also imposes a fixed temporal resolution: a patch size of 16 on hourly data means each token represents 16 hours, which may be too coarse for some patterns and too fine for others. Some implementations mitigate this with overlapping patches or multi-scale patching, at the cost of additional complexity.

Lag-Based Tokenization

Lag-Llama, introduced by Rasul et al. (2023), takes a fundamentally different approach. Rather than grouping adjacent timesteps, it constructs each token from the value at time t plus a vector of lagged values: x(t-1), x(t-7), x(t-14), x(t-28), x(t-364), and so on. The lag indices are chosen to capture common seasonal periodicities. This lag vector is projected through a linear layer to form the token embedding, and a decoder-only transformer processes these tokens autoregressively.

The strength of this approach is explicit access to seasonal context. A patch-based model must learn to extract weekly or annual patterns from raw signal across multiple patches. A lag-based model receives seasonally relevant historical values directly in every token, giving the transformer a strong inductive bias toward periodic structure. This makes Lag-Llama particularly effective on data with clear, known seasonalities — retail demand with weekly cycles, energy consumption with daily and annual patterns, or web traffic with day-of-week effects.

The trade-off is twofold. First, each timestep produces its own token, so the sequence length equals the context window length. This limits how far back the model can look compared to patched models that compress the input. Second, the choice of lag indices encodes domain assumptions. The standard lag set works well for common business and environmental frequencies, but unusual periodicities (e.g., 5-day trading weeks, lunar cycles) may be underrepresented.

Choosing a Strategy

No single tokenization approach dominates across all scenarios. Patching excels when long context is important and the primary challenge is capturing dependencies over hundreds or thousands of timesteps — the quadratic compression makes this tractable. Quantization offers the most direct path to leveraging pretrained language model weights and works well as a general-purpose strategy, particularly when architectural flexibility matters more than fine-grained value resolution. Lag-based tokenization is the strongest choice when the data has well-understood seasonal structure and the goal is to maximize accuracy on those patterns with a compact model.

In practice, the choice often depends on the operational context as much as the data. For a deeper discussion of how these architectural differences affect real-world evaluation, see our post on benchmarking challenges across TSFMs. To experiment with models using each strategy, visit our model catalog.

Related articles