inferencegpuperformanceengineering

Scaling TSFM Inference: GPU Optimization

Serving TSFMs at scale requires careful GPU optimization. Here's how we achieve sub-100ms latency for batch forecasting.

T
TSFM.ai Team
December 1, 20255 min read

Scaling TSFM Inference: GPU Optimization

Serving time series foundation models in production presents a distinct set of engineering challenges compared to serving large language models. The workload profile is fundamentally different: instead of a few long sequences with heavy autoregressive decoding, TSFM inference typically involves many short sequences processed in large batches. A single API call might contain 10,000 individual time series, each with 512 observations, requesting a 96-step forecast. Optimizing for this workload pattern requires rethinking batching, memory management, and GPU utilization strategies.

The TSFM Inference Profile

A typical LLM serving request involves a single prompt of hundreds to thousands of tokens, followed by sequential token generation. TSFM requests look nothing like this. The input is a batch of numerical sequences (not discrete tokens), the context lengths are shorter (256 to 2048 values), and for encoder-based models like MOMENT or PatchTST, the entire output is produced in a single forward pass rather than autoregressively.

Even autoregressive TSFMs like Chronos generate far fewer output tokens than an LLM. A 96-step forecast with a vocabulary of 4096 tokens requires 96 decoding steps, compared to hundreds or thousands for text generation. This means the prefill (encoding) phase dominates total latency, not the decode phase.

These characteristics create opportunities for optimization that differ from standard LLM serving.

Batching Strategies

Static batching pads all sequences in a batch to the maximum length, wasting compute on padding tokens. For TSFM workloads where series lengths vary (one client sends 128 observations, another sends 1024), static batching can waste 50% or more of GPU FLOPs on padding.

Dynamic batching groups incoming requests by similar sequence length into bins, processing each bin as a separate batch. We use length-based bucketing with bins at 128, 256, 512, 1024, and 2048 tokens. Sequences are padded only to their bin boundary, reducing wasted compute to under 15% in practice. Requests accumulate in a queue with a configurable maximum wait time (default 10ms), balancing latency against batching efficiency.

Continuous batching is less relevant for encoder-based TSFMs (which process all tokens in parallel) but matters for autoregressive models like Chronos. (For more on how Chronos v2 improves on the original architecture, see Chronos v2: What's New.) We implement iteration-level scheduling: as shorter sequences in a batch complete their decoding, new sequences are inserted into the freed slots without waiting for the entire batch to finish. This keeps GPU utilization high even when forecast horizons vary across requests.

Quantization and Precision

Model quantization reduces memory footprint and increases throughput with minimal accuracy impact.

FP16 inference is the baseline for all TSFM serving. The conversion from FP32 training weights to FP16 is lossless for practical purposes, halves memory usage, and roughly doubles throughput on GPUs with tensor cores.

INT8 quantization (via post-training quantization with calibration) further reduces memory by 2x and increases throughput by an additional 30-40% on hardware with INT8 tensor cores. We evaluated INT8 Chronos-Large on our standard benchmark suite and observed less than 1% MASE degradation averaged across datasets. The occasional series where INT8 hurts accuracy (typically those with very small absolute values) can be detected and routed to FP16 inference.

INT4 quantization pushes further but introduces measurable accuracy loss (2-4% average MASE increase). We reserve INT4 for latency-critical applications where the accuracy tradeoff is acceptable, or for the largest models where INT8 does not fit in GPU memory.

GPU Selection and Parallelism

Different GPU types suit different model sizes and throughput requirements.

NVIDIA A10G (24GB VRAM, available on AWS g5 instances) handles models up to roughly 400M parameters in FP16. Chronos-Base (200M parameters) achieves approximately 15,000 series/second on a single A10G with dynamic batching and INT8 quantization. This is our default serving GPU for cost-sensitive deployments.

NVIDIA A100 (80GB VRAM) enables serving larger models and larger batch sizes. Chronos-Large (700M parameters) runs comfortably in FP16 with batch sizes up to 512, achieving roughly 8,000 series/second. The higher memory bandwidth (2 TB/s vs 600 GB/s on A10G) particularly benefits the memory-bound attention operations in encoder-based models.

NVIDIA H100 (80GB VRAM, FP8 tensor cores) provides the best absolute throughput. FP8 inference on H100 achieves roughly 2x the throughput of FP16 on A100 for equivalent models, with accuracy comparable to INT8. For peak-traffic serving, H100s provide the lowest cost-per-forecast at high utilization.

For frameworks like NVIDIA TensorRT, kernel fusion and graph optimization can squeeze additional throughput from these GPUs. For models that exceed single-GPU memory (rare for current TSFMs, but relevant for experimental billion-parameter variants), we use tensor parallelism to shard model weights across 2-4 GPUs within a node. For smaller models where a single GPU has spare capacity, we run multiple model replicas per GPU using CUDA MPS (Multi-Process Service) to maximize utilization.

Serving Stack

Our serving infrastructure uses a vLLM-inspired architecture adapted for time series workloads. The key components:

CUDA graph compilation eliminates kernel launch overhead for repeated inference patterns. Because TSFM inputs have predictable shapes (fixed patch sizes, binned sequence lengths), we pre-compile CUDA graphs for each (model, batch_size, sequence_length) tuple at startup. This reduces per-inference CPU overhead from 2-3ms to under 0.1ms.

torch.compile with max-autotune optimizes the model's forward pass by fusing operations and selecting optimal CUDA kernels. For encoder-based models, this provides a 20-30% throughput improvement over eager execution. The compilation cost (30-60 seconds per model) is amortized at server startup.

Precomputed embeddings cache the patch embedding and positional encoding computations. For models like MOMENT where the patch embedding is a simple linear projection, precomputing these for common input shapes avoids redundant computation across requests.

Latency Targets

Our production SLOs target P50 latency under 50ms and P99 under 200ms for single-series forecasts, and P50 under 100ms for batch requests up to 1,000 series. We achieve these targets through the combination of dynamic batching, CUDA graph compilation, and right-sized GPU selection.

Autoscaling

Traffic to the forecast API is bursty. Retail customers trigger large batch jobs during nightly planning runs; energy customers spike during morning ramp-up periods. We use GPU utilization-based autoscaling with a target of 70% average utilization. Scale-up is aggressive (new replicas launch within 90 seconds using pre-warmed container images), while scale-down is conservative (15-minute cooldown) to avoid thrashing during intermittent traffic patterns. Pre-warmed containers with models already loaded in GPU memory eliminate the cold-start penalty that would otherwise add 30-60 seconds of latency for the first request to a new replica. For guidance on integrating these optimizations into end-to-end systems, see building production forecast pipelines. To learn more about TSFM.ai's platform and the models we serve, read our introduction post.

Related articles