Toto: Datadog's Domain-Specific TSFM for Observability
Datadog's Toto is a 151M-parameter transformer trained on trillions of observability data points, purpose-built for forecasting infrastructure metrics like CPU utilization, error rates, and request latency.
Most time series foundation models are trained on general-purpose corpora: retail sales, electricity consumption, weather stations, economic indicators. These datasets produce models that generalize broadly, but they leave a gap when the target domain has statistical properties that diverge sharply from the training distribution. Observability metrics — the CPU utilization graphs, request rate counters, error tallies, and latency percentiles that define modern infrastructure monitoring — are exactly that kind of domain. In October 2025, Datadog released Toto, a time series foundation model built specifically for this problem.
Why Observability Metrics Are Different
Infrastructure telemetry has distinctive statistical characteristics that set it apart from the time series domains where most TSFMs are evaluated.
Irregular spikes and regime changes. A deployment rollout can shift the baseline of a latency metric in an instant. Garbage collection pauses produce sharp, periodic spikes in memory and CPU usage that are normal behavior, not anomalies. Auto-scaling events create abrupt step changes in request distribution across instances. These regime changes are not well-represented in retail or weather datasets, where transitions tend to be gradual or seasonal.
Strong diurnal and weekly seasonality with noisy overlays. Production traffic follows human activity patterns — clear day/night cycles, weekday/weekend differences — but layered on top of these predictable rhythms is high-frequency noise from individual requests, background jobs, and cron tasks. The signal-to-noise ratio is often much lower than in curated benchmark datasets.
Irregular and mixed sampling rates. Observability platforms ingest metrics at varying frequencies. Some infrastructure counters report every 10 seconds, others every 5 minutes. Agents may drop data points during high load, creating gaps. A model trained primarily on clean, regularly-sampled academic datasets may struggle with this irregularity.
Multivariate correlations that matter. CPU utilization, memory pressure, request throughput, and error rate are deeply correlated in ways that reflect the physical and logical structure of the systems they measure. A spike in error rate concurrent with elevated p99 latency and normal throughput tells a different story than the same error spike alongside a throughput surge.
Toto's Architecture
Toto is a 151M-parameter decoder-only transformer — the same architectural family that powers most modern large language models, applied here to continuous-valued time series. The architecture uses patch-based input embedding with a fixed patch size of 32, dividing the time dimension into patches that are linearly projected into the model's latent space before being processed by the transformer stack.
A key design choice is the factorized attention mechanism. Rather than treating all variates and all time steps in a single attention computation, Toto's transformer blocks alternate between time-wise attention (modeling temporal dependencies within each variate) and channel-wise attention (modeling cross-variate correlations). This factorization allows the model to efficiently handle multivariate observability series — where dozens of metrics from a single host are naturally correlated — without the quadratic scaling cost of full space-time attention.
For output, Toto uses a Student-t mixture distribution head. This is a meaningful departure from models that use Gaussian mixtures or binned token distributions. The Student-t distribution's heavier tails make it naturally suited to observability data, where extreme values (latency spikes, traffic surges) are far more common than a Gaussian assumption would predict. The mixture formulation allows the model to represent multimodal forecast distributions, which arise naturally when a metric could either remain stable or undergo a regime shift. For background on probabilistic output approaches, see prediction intervals vs. point forecasts.
Training at Observability Scale
The defining characteristic of Toto's training is scale and domain composition. The model was pretrained on approximately 2.36 trillion time series data points, making it one of the largest training corpora used for any published TSFM. Critically, a large portion of this data consists of fully anonymous numerical metric data points drawn from Datadog's observability platform — real CPU counters, real request rates, real error logs from production infrastructure around the world.
The remainder of the training mix includes publicly available time series datasets from the LOTSA archive and other standard sources, providing the model with exposure to non-infrastructure domains. But the heavy weighting toward actual observability telemetry is what distinguishes Toto from general-purpose alternatives like Chronos or Moirai, which rely on academic benchmarks, synthetic data, and curated public datasets.
This matters because distribution matters. A model learns the statistical patterns present in its training data. If that data is predominantly retail demand and weather forecasts, the model's learned representations will be optimized for those dynamics — smooth seasonal curves, holiday effects, gradual trends. Observability metrics require representations that capture instantaneous regime shifts, correlated multi-metric failures, and the characteristic sawtooth patterns of resource utilization under load.
Zero-Shot Performance on New Infrastructure
Like other modern TSFMs, Toto supports zero-shot forecasting — it can produce predictions on metrics it has never seen before, without any fine-tuning. This is particularly valuable in infrastructure contexts. Teams spin up new services, add new metrics, and provision new infrastructure constantly. Waiting to accumulate enough historical data to train or fine-tune a per-metric model is impractical when you need forecasts from day one.
Datadog's evaluations show that Toto achieves state-of-the-art accuracy on the BOOM (Benchmark for Observability Open Metrics) suite, a public benchmark containing 350 million observations across 2,807 real-world observability time series. On general-purpose benchmarks like GIFT-Eval, Toto also performs competitively, demonstrating that domain specialization does not come at the cost of general capability.
Practical Use Cases
Capacity planning. Forecasting CPU, memory, and disk utilization days or weeks ahead allows infrastructure teams to provision resources before saturation. Toto's understanding of infrastructure-specific patterns — growth curves under increasing load, periodic batch job impacts, weekend traffic drops — produces more accurate capacity forecasts than general-purpose models.
Alerting threshold optimization. Static alert thresholds generate noise. A p99 latency of 200ms might be normal during peak hours and deeply anomalous at 3 AM. By generating context-aware forecasts with calibrated prediction intervals, Toto enables dynamic thresholds that adapt to expected behavior. For more on this approach, see anomaly detection with TSFMs.
Cloud cost forecasting. Infrastructure spend is a function of resource consumption over time. Accurate forecasts of compute, storage, and network utilization translate directly into better cost projections, enabling finance and engineering teams to anticipate budget impact before it materializes.
Intelligent auto-scaling. Reactive auto-scaling responds to load after it arrives. Predictive auto-scaling, powered by short-horizon forecasts of request volume or resource consumption, can provision capacity in advance, reducing both latency spikes during scale-up and wasted spend during over-provisioning.
Where Toto Fits on TSFM.ai
Toto represents a broader trend in the TSFM landscape: domain specialization. The first generation of foundation models aimed for universality — one model for all time series. Toto demonstrates that training on domain-specific data at scale can produce meaningfully better results within that domain while retaining competitive general-purpose performance. For teams running model routing strategies, Toto becomes the natural choice when the input is infrastructure telemetry, while general-purpose models handle other domains.
You can explore Toto alongside other foundation models in our model catalog, experiment with observability forecasting in the playground, or read about scaling inference for GPU-accelerated model serving to understand deployment considerations. For teams building production forecasting systems around observability data, see building production forecast pipelines.