synthetic-datatrainingkernelsynthgaussian-processtechnical-deep-dive

Synthetic Training Data for TSFMs: KernelSynth and Gaussian Process Augmentation

How KernelSynth uses Gaussian process priors with composed kernels to generate synthetic time series, and why roughly half of Chronos's training data is artificially generated.

TSFM.ai Team

September 2, 20245 min read

Training a time series foundation model requires large volumes of diverse temporal data, but the available supply of public time series datasets is orders of magnitude smaller than what exists for language or vision. The largest curated time series corpora contain tens of billions of observations — substantial, but nowhere near the trillions of tokens that underpin modern LLMs. This data bottleneck is the central constraint shaping how TSFMs are trained, and it has pushed researchers toward a practical solution: generating synthetic time series at scale.

#The Data Bottleneck

Public time series datasets are scattered across domains and formats. The Monash Time Series Repository, one of the most comprehensive collections, contains roughly 30 benchmark datasets spanning electricity, traffic, weather, retail, and finance. Other sources like the UCR/UEA archive focus on classification rather than forecasting. When aggregated, these collections provide meaningful coverage but fall short of the scale and diversity needed to train a general-purpose foundation model.

Compare this to NLP, where Common Crawl alone provides hundreds of terabytes of text, or to computer vision, where ImageNet and LAION supply hundreds of millions of labeled images. Time series data is inherently harder to collect at scale: it is often proprietary, domain-specific, irregularly sampled, and governed by privacy or commercial restrictions. This asymmetry forces TSFM researchers to be creative about where their training data comes from.

#KernelSynth: Gaussian Process Priors as a Data Factory

KernelSynth, introduced as part of the Chronos project by Amazon, addresses the data bottleneck by generating synthetic time series from Gaussian process (GP) priors. The core idea is elegant: a Gaussian process defines a distribution over functions, and by sampling from that distribution with different kernel configurations, you can produce an essentially unlimited supply of time series exhibiting different temporal behaviors.

The procedure works as follows. A set of base kernel functions is defined — typically including the radial basis function (RBF) kernel for smooth variation, a periodic kernel for seasonality, a linear kernel for trends, and a white noise kernel for observation noise. These base kernels are then randomly composed through addition and multiplication to form composite kernels. Each composite kernel defines a different GP prior, and sampling from that prior produces a time series with the corresponding temporal structure.

For example, an RBF kernel plus a periodic kernel produces a time series with smooth underlying variation overlaid with regular seasonal oscillations. Multiplying a linear kernel with a periodic kernel produces seasonality whose amplitude grows or shrinks over time. Adding white noise to any combination introduces realistic observation noise. The compositional nature of kernel algebra means that a relatively small set of base kernels can generate a combinatorially large space of temporal patterns. For a thorough treatment of how kernel composition works, see Rasmussen and Williams.

#Why GP-Based Synthesis Works

The strength of GP-based synthesis lies in its coverage of the space of plausible real-world time series. Trends, seasonality, smooth variation, abrupt level shifts (via changepoint kernels), and noise are the fundamental building blocks of most temporal phenomena. By systematically combining kernels that express each of these primitives, KernelSynth generates training data that spans the statistical structures a model will encounter in practice — without requiring access to actual datasets from each domain.

This is not an accident. Gaussian processes have a long history as flexible priors in Bayesian time series modeling precisely because kernel composition can approximate a wide range of covariance structures. KernelSynth repurposes this mathematical property for data augmentation rather than inference.

#How Chronos Uses Synthetic Data

In the Chronos training pipeline, roughly half of all training data comes from KernelSynth-generated synthetic series. This is a striking ratio: a model that achieves strong zero-shot generalization across dozens of held-out benchmarks was trained on data that is, in large part, artificially generated. The synthetic data serves as a regularizer, exposing the model to temporal patterns that may be underrepresented or absent in the available real-world corpora.

Beyond KernelSynth, Chronos also employs TSMix, an augmentation strategy that creates new training examples by mixing and warping existing real time series. TSMix applies convex combinations and temporal distortions to pairs of real series, producing augmented examples that preserve realistic statistical properties while increasing effective dataset size. Together, KernelSynth and TSMix form a two-pronged augmentation pipeline — one generating entirely novel patterns from GP priors, the other expanding the real data distribution through controlled perturbation.

Chronos-Bolt continued and expanded this synthetic data pipeline, confirming that the approach scales effectively to second-generation architectures.

#The Debate: Synthetic Biases and Limitations

Synthetic data is not without risks. Models trained heavily on GP-generated series may internalize smoothness assumptions that do not hold for all real-world data. Financial tick data, IoT sensor readings with sharp spikes, and intermittent demand series all exhibit behaviors that standard GP kernels struggle to express. If a model's implicit prior is too smooth, it may underestimate tail risk or fail to capture discontinuities in domains where those patterns matter most.

Training data composition also affects benchmark fairness. If synthetic data happens to cover the statistical signatures of certain benchmark datasets well, models trained on it may appear to generalize better than they actually do on truly out-of-distribution problems.

#Contrasting Approaches: LOTSA and Time-300B

Not all TSFMs rely on synthetic augmentation. Moirai from Salesforce was pretrained on LOTSA (Large-scale Open Time Series Archive), a curated collection of approximately 27 billion real observations from nine domains. LOTSA uses no synthetic data — its scale comes entirely from aggregating and cleaning public datasets. The trade-off is a smaller and potentially less diverse training corpus, but one that is guaranteed to reflect real-world statistical properties.

Time-MoE from Xiaohongshu takes yet another path, assembling Time-300B — approximately 300 billion real time points — emphasizing sheer scale of real data over synthetic augmentation. With enough real data volume and domain coverage, the argument goes, synthetic augmentation becomes less necessary.

#The Emerging Consensus

The evidence from recent TSFM development points toward a practical consensus: synthetic and real data together tend to outperform either source alone. Synthetic data provides coverage of the pattern space that sparse real datasets cannot, while real data anchors the model to the statistical properties of actual forecasting problems. The optimal ratio and synthesis strategy likely depend on the target domain and available real data volume, but the principle of complementary data sources appears robust.

For practitioners choosing between models, understanding the training data composition helps set expectations. A model trained with heavy synthetic augmentation may excel at smooth, well-structured series but struggle with noisy or discontinuous data. A model trained exclusively on real data may lack coverage of rare temporal patterns. Evaluating both on your specific data — rather than relying on aggregate benchmarks — remains the most reliable approach. Explore the full range of available models in the TSFM.ai model catalog.

Synthetic Training Data for TSFMs: KernelSynth and Gaussian Process Augmentation

#The Data Bottleneck

#KernelSynth: Gaussian Process Priors as a Data Factory

#Why GP-Based Synthesis Works

#How Chronos Uses Synthetic Data

#The Debate: Synthetic Biases and Limitations

#Contrasting Approaches: LOTSA and Time-300B

#The Emerging Consensus

Run these models on your own data

Related articles

Tokenization Strategies Compared: Quantization vs Patching vs Lag-Based

TiRex-2: Multivariate xLSTM Forecasting With Covariates and Streaming State

No Adjacency Matrix Required: TSFMs as Strong Baselines in Transportation Forecasting