model-routingtsfm-aiarchitectureoptimization

Smart Model Routing: Choosing the Best TSFM

Not every time series needs the same model. TSFM.ai's routing engine automatically selects the best foundation model for each request.

TSFM.ai Team

October 12, 20255 min read

The time series foundation model landscape now includes over a dozen viable options: Chronos, Moirai, TimesFM, MOMENT, Lag-Llama, Timer, and more. Each model was pretrained on different data, uses a different architecture, and excels in different regimes. Chronos performs well on short univariate series with clear seasonality. Moirai handles multivariate inputs with covariates. TimesFM shines on long-horizon forecasts with extended context windows. Treating any single model as universally best leaves accuracy on the table for a significant fraction of real-world inputs.

TSFM.ai's routing engine solves this problem by automatically selecting the optimal model for each incoming request based on the characteristics of the input data. You can explore all supported models in our model catalog or test them in the playground.

#The Model Selection Problem

Consider a platform serving diverse customers. One sends hourly electricity consumption data with 720 observations and requests a 168-step (one-week) forecast. Another sends daily retail sales for a new product with only 30 observations. A third sends a 50-variable sensor dataset from a manufacturing line. The ideal model differs for each case.

Manual model selection pushes this complexity onto users, requiring them to understand the strengths and limitations of each TSFM. Defaulting to a single model sacrifices accuracy for simplicity. A routing layer that automatically matches inputs to models gives users the best of both worlds: a single API endpoint with multi-model accuracy.

#How the Router Works

The routing engine is a lightweight classifier that sits between the API gateway and the model serving layer. When a forecast request arrives, the router extracts a feature vector from the input time series and predicts which model will produce the most accurate forecast. The selected model then handles inference.

The full routing path adds less than 5 milliseconds of latency, a negligible overhead compared to the model inference itself.

#Routing Features

The feature extraction stage computes a compact set of statistical descriptors that characterize the input series without running any model.

Spectral entropy measures signal complexity. A pure sine wave has low spectral entropy; white noise has maximum entropy. Series with low spectral entropy (strong periodic components) tend to favor models like Chronos that handle seasonality well. High-entropy series benefit from models with larger context windows that can capture irregular patterns.

Coefficient of variation (standard deviation divided by mean) captures relative variability. Highly variable series with CV above 1.0 are often better served by probabilistic models that naturally produce wide prediction intervals, like Moirai or Chronos, rather than point-forecast models.

Dominant frequency is extracted via FFT peak detection and identifies the primary seasonal period. This feature helps route to models whose patch sizes or context windows align well with the detected seasonality. A series with a dominant period of 168 (weekly at hourly granularity) benefits from models that can see at least one full seasonal cycle in their context window.

Series length relative to model context windows is a critical routing signal. If the input history is shorter than a model's minimum effective context (typically 2-3 seasonal cycles), that model is deprioritized. Conversely, if the series is very long and a model has a large context window (TimesFM supports up to 2048 tokens), it can leverage the additional history.

Multivariate dimension count routes high-dimensional inputs to models with native multivariate support (Moirai, iTransformer) rather than models that process each variate independently.

Missing value density identifies series with gaps. Models with native imputation capabilities (MOMENT) or robust masking strategies are preferred when missing data exceeds a threshold.

#Training the Router

Building the routing classifier required a systematic evaluation campaign. We ran every supported model on a large evaluation suite spanning thousands of series from diverse domains: energy, retail, finance, weather, transportation, and healthcare. Each series was forecasted by every model, and we recorded per-series accuracy using MASE (Mean Absolute Scaled Error) and CRPS (Continuous Ranked Probability Score) for probabilistic models.

For each series, we labeled the best-performing model as the target class. (For more on the challenges of comparing TSFMs fairly, see our benchmarking deep dive.) This produced a labeled dataset of (feature vector, best model) pairs. We trained a gradient-boosted classifier (LightGBM) on this labeled data, using the extracted routing features as inputs.

The classifier achieves roughly 72% top-1 accuracy in selecting the actual best model, and 91% top-2 accuracy. Even when the router does not pick the absolute best model, it rarely picks a poor one, because the top-2 accuracy ensures the selected model is almost always competitive.

#Results: Routing vs. Single Model

In our evaluation, the routed ensemble achieves an average 8-12% improvement in MASE over any single model used alone. The improvement is largest for heterogeneous workloads where input series span different domains and characteristics. For homogeneous workloads (e.g., all hourly energy data), the benefit is smaller (3-5%) because a single well-suited model already performs well.

The routed approach also improves tail performance. The worst-case forecast error (95th percentile MASE) drops significantly, because the router avoids assigning series to models that are known to struggle with their particular characteristics.

#Fallback and Confidence

When the router's prediction confidence is low (maximum class probability below 0.4), the system defaults to a robust general-purpose model rather than making a low-confidence routing decision. Currently, this fallback is Chronos-Large, which provides consistently strong baseline performance across the widest range of input types.

The routing confidence is also exposed in the API response metadata, allowing users to understand whether their request was routed with high or low certainty. Users can override the router and specify a model directly if they have domain knowledge about which TSFM works best for their data.

#Future Directions

The current routing approach uses a static classifier trained offline. We are exploring learned routing, where the router is jointly optimized with model inference through reinforcement learning on forecast accuracy feedback. We are also investigating dynamic ensembling, where instead of selecting a single model, the router assigns weights to multiple models and blends their predictions. Early experiments show that ensembling two or three models with data-dependent weights can push accuracy improvements to 15-18% over single-model baselines, at the cost of proportionally higher inference compute. For details on how we manage that compute budget, see Scaling TSFM Inference: GPU Optimization.

Smart Model Routing: Choosing the Best TSFM

#The Model Selection Problem

#How the Router Works

#Routing Features

#Training the Router

#Results: Routing vs. Single Model

#Fallback and Confidence

#Future Directions

Run these models on your own data

Related articles

iAmTime: Instruction-Conditioned In-Context Learning for Multi-Task Time Series Foundation Models

TSFMs vs World Models: Two Philosophies of Prediction

Encoder-Only, Decoder-Only, and Encoder-Decoder: Architecture Tradeoffs in Time Series Foundation Models