Real-Time Streaming Inference with TSFMs: Moving from Batch to Continuous Forecasting
Most TSFM deployments run forecasts in hourly or daily batches, but a growing class of use cases demands continuous, low-latency predictions. Here's how to architect streaming inference pipelines for time series foundation models.
The default deployment pattern for time series foundation models is batch inference. Collect a window of history, run a forecast, store the results, repeat on a schedule. This approach works well for demand planning, weekly capacity reviews, and any scenario where decisions happen on human timescales. Most of the production forecast pipelines running today follow this pattern, and for good reason: batching maximizes GPU utilization, simplifies error handling, and aligns naturally with how data warehouses operate.
But batch inference introduces latency by design. If your pipeline runs hourly and a critical shift happens five minutes after the last run, you wait 55 minutes for the next forecast. For a growing set of use cases, that delay is unacceptable.
Use Cases That Demand Continuous Prediction
Several domains are pushing TSFM inference toward true streaming. Algorithmic trading systems need sub-second forecast updates as new price ticks arrive. Cloud auto-scaling triggers must anticipate traffic spikes minutes before they hit, not after the hourly batch catches up. Fraud detection pipelines score transactions in real time, and any delay in anomaly detection directly translates to financial loss. IoT sensor networks in manufacturing generate thousands of readings per second, each potentially signaling equipment degradation that demands immediate attention.
Infrastructure monitoring is another natural fit. Tools like Toto already target observability workloads where metric streams are continuous by nature. Telecom operators performing real-time capacity planning cannot afford to wait for a batch cycle when a cell tower approaches saturation. In each of these scenarios, the value of a forecast decays rapidly with age.
Architecture Patterns for Streaming Inference
Three patterns have emerged for running TSFMs against continuous data streams, each trading off latency, throughput, and operational complexity.
Sliding window on a stream processor. The most common approach connects a TSFM to a stream processing framework like Apache Kafka or Apache Flink. As new observations arrive on a topic, a windowing operator maintains the rolling context buffer for each series. When the window advances, it triggers an inference call. Flink's event-time semantics handle out-of-order and late-arriving data gracefully, which matters in distributed sensor networks where clock drift is inevitable. The tradeoff is infrastructure overhead: you are operating a distributed stream processor alongside your model serving layer.
Model-as-microservice with request batching. Deploy the TSFM behind a serving framework like KServe and let upstream services push individual inference requests. The serving layer accumulates requests over a short window (5-50ms) and dispatches them as a single batched GPU call. This pattern reuses the GPU optimization techniques proven in batch pipelines while exposing a request-response interface that streaming clients can call on every new observation. Latency depends on the batching window and model speed, typically landing in the 50-200ms range for lightweight models.
Edge inference for sub-10ms latency. When network round-trips are too slow, push the model to the edge. Granite TTM runs in approximately 95ms on CPU and can be compiled to ONNX for further speedup, making it viable on gateway devices and embedded hardware. Edge deployment eliminates network latency entirely and keeps raw sensor data local, which matters in regulated environments. The cost is model capability: only the smallest TSFMs fit this profile.
Which Models Are Fast Enough
Not every TSFM is suitable for streaming workloads. Latency per inference call determines the ceiling on update frequency. Among current models, Granite TTM at roughly 95ms on CPU and Chronos-Bolt-Small at approximately 88ms on GPU are fast enough for near-real-time pipelines with update intervals of a few hundred milliseconds. Diffusion-based models like Sundial land around 140ms, which is workable for second-scale updates. Heavier architectures such as Time-LLM, at roughly 650ms per call, are effectively limited to batch or infrequent micro-batch patterns.
You can compare these models directly in the TSFM.ai model catalog or test latency characteristics in the playground.
KV Cache Reuse for Incremental Updates
For decoder-style TSFMs that generate forecasts autoregressively, a significant optimization is reusing the key-value cache from the context window. When a new observation arrives and the window slides forward by one step, most of the context remains unchanged. Rather than reprocessing the full history, the model can evict the oldest cached position, append the new observation's key-value state, and run only the incremental computation. This reduces per-update latency by 40-60% compared to full recomputation, making autoregressive models more competitive in streaming settings. The technique mirrors KV cache strategies in LLM serving but applied to shorter, numerically dense sequences.
Challenges in Continuous Pipelines
Streaming inference introduces operational challenges that batch pipelines avoid. State management across sliding windows requires careful checkpointing: if a node fails, you need to reconstruct the context buffer for every active series without reprocessing the full history. Late-arriving data can invalidate forecasts that were already emitted, requiring either watermarking strategies or explicit forecast revision semantics downstream. Model versioning becomes harder when you cannot simply swap the model between batch runs. Rolling updates in a streaming pipeline must handle the transition period where two model versions coexist, potentially producing inconsistent forecasts for the same series.
Monitoring also shifts from batch-oriented metrics to streaming-specific concerns: forecast staleness (how old is the context behind the current prediction), throughput per partition, and consumer lag. The model routing layer must account for latency budgets when selecting models in real time, not just accuracy.
Practical Guidance: Start Near-Real-Time
For most teams, the pragmatic path is to start with near-real-time micro-batching (1-5 minute intervals) before committing to true per-event streaming. Micro-batching captures the majority of the latency improvement over hourly or daily runs while preserving the simpler operational model of batch inference. It also lets you validate that your downstream consumers can actually act on more frequent forecasts before investing in full streaming infrastructure.
As the TSFM.ai platform evolves, we are building toward native streaming support: WebSocket endpoints for persistent forecast connections, server-sent events for continuous forecast updates pushed to dashboards, and tighter integration with stream processing frameworks. The goal is to make the transition from batch to streaming a configuration change rather than an architecture redesign.