granite-ttmibmmlp-mixeredge-deploymentmodel-architecture

Tiny Models, Big Results: IBM's Granite TTM and the MLP-Mixer Architecture for Time Series

IBM's Granite TTM packs competitive forecasting accuracy into roughly 1 million parameters by replacing attention with MLP-Mixer layers, enabling sub-100ms inference on CPU and opening the door to edge and serverless deployment.

T
TSFM.ai Team
March 12, 20253 min read

Most time series foundation models gain accuracy by scaling up. Larger transformers, longer context windows, more pretraining data. IBM Research took the opposite approach with Granite TinyTimeMixer (TTM), a forecasting model that fits in roughly 1 million parameters and runs inference in under 100 milliseconds on a CPU. The key architectural decision: replacing self-attention entirely with MLP-Mixer blocks, a design that trades the generality of attention for the efficiency of structured mixing operations purpose-built for patched time series.

MLP-Mixer: Attention-Free by Design

The MLP-Mixer architecture, originally proposed by Google Brain for vision tasks, processes sequences through two types of fully-connected layers applied in alternation. Token-mixing MLPs operate across positions in the sequence, allowing information to flow between different time steps. Channel-mixing MLPs operate within each position, transforming feature representations independently at each step. There is no attention mechanism, no key-query-value computation, and no softmax normalization.

This works because patched time series data has strong spatial structure. When a raw series is segmented into non-overlapping patches (similar to the approach used in PatchTST and other patch-based models), each patch encodes a local temporal pattern. The relationships between these patches tend to be relatively regular: adjacent patches share trend information, seasonal patches repeat at fixed intervals. Token-mixing MLPs can learn these structured interactions directly without the overhead of computing pairwise attention scores across all positions.

The computational advantage is significant. Self-attention scales quadratically with sequence length, while MLP-Mixer operations scale linearly. For the short-to-medium patch sequences typical in forecasting (16 to 64 patches), this difference translates to meaningfully lower latency and memory consumption.

TSMixer: Adapting for Time Series

Granite TTM builds on an adapted variant called TSMixer, which introduces resolution-aware mixing to handle the multi-scale nature of temporal data. Standard MLP-Mixer treats all token positions identically, but time series patches at different temporal resolutions (hourly vs. daily vs. weekly) carry different types of information. TSMixer adds resolution-specific mixing pathways that allow the model to process fine-grained and coarse-grained patterns through separate channels before combining them.

The architecture also incorporates cross-channel mixing for multivariate inputs, enabling the model to capture correlations between different variables (e.g., temperature and energy demand) without the parameter cost of full multi-head attention across both the time and variable dimensions. This keeps the total parameter count anchored near 1 million even when handling multi-feature inputs.

Inference Speed and Deployment Flexibility

Granite TTM averages approximately 95 milliseconds per forecast on CPU hardware. This is fast enough for real-time dashboards, serverless function invocations, and deployment on edge devices including hardware as constrained as a Raspberry Pi. Compare this to Chronos-Bolt-Small at 48 million parameters, which requires GPU acceleration for high-throughput workloads. For teams running GPU-optimized inference pipelines, Granite TTM barely registers on utilization metrics, freeing accelerator capacity for larger models.

Training efficiency follows the same pattern. The model trains in hours rather than days on a single GPU, and fine-tuning on domain-specific data requires minimal compute. This makes it practical to maintain per-customer or per-domain fine-tuned variants without spiraling infrastructure costs.

Benchmark Results: Punching Above Its Weight

IBM evaluated Granite TTM against established benchmarks including the Monash repository and standard multivariate datasets (ETT, Weather, Electricity). The results are striking: TTM matches or comes close to PatchTST, a 40-million-parameter transformer, on many forecasting tasks despite having roughly 40 times fewer parameters. On short-horizon univariate benchmarks, TTM frequently wins outright. On longer horizons and complex multivariate settings, the accuracy gap narrows but transformer-based models retain an edge.

The model is available on Hugging Face and documented in IBM's Granite time series research blog.

When to Choose Granite TTM

Granite TTM fits a specific and increasingly common set of requirements. If your workload involves high-throughput batch inference across thousands of series, cost-sensitive environments where GPU hours are budgeted tightly, edge deployment for predictive maintenance, or real-time dashboards that need sub-100ms response times, TTM is a strong candidate. It excels in production forecast pipelines where latency and cost matter as much as accuracy.

Where TTM is less suited: tasks requiring long context windows (its context length is limited compared to transformer-based models), complex multivariate dependency modeling where cross-variable attention provides a meaningful accuracy boost, or scenarios where you need a single model to handle forecasting alongside classification and anomaly detection.

The Efficiency Frontier

Granite TTM represents a broader trend in the TSFM space toward finding the efficiency frontier: the smallest model that delivers acceptable accuracy for a given task. Not every forecasting problem needs hundreds of millions of parameters. For many production workloads, a 1-million-parameter model running on CPU is not a compromise but the right architectural choice.

Explore Granite TTM alongside other models in the TSFM.ai catalog, or test it directly in the playground.

Related articles