Fine-Tuning vs. Zero-Shot: When to Customize
Zero-shot TSFMs are powerful out of the box, but sometimes fine-tuning on your data delivers a meaningful accuracy boost. Here's how to decide.
Fine-Tuning vs. Zero-Shot: When to Customize
Time series foundation models are trained on massive, diverse corpora spanning retail demand, energy consumption, financial indicators, weather observations, and web traffic. This broad pretraining gives them remarkable zero-shot generalization. You can feed a TSFM data it has never seen from a domain it has never encountered and get surprisingly accurate forecasts.
But "surprisingly accurate" is not always "accurate enough." The question practitioners face is concrete: does fine-tuning a foundation model on my domain-specific data improve accuracy enough to justify the cost? The answer depends on several factors that we can reason about systematically.
When Zero-Shot Is Sufficient
Zero-shot inference works well in specific, identifiable circumstances.
Standard temporal patterns. If your data exhibits common seasonality (daily, weekly, yearly), trend behavior, and noise structures that are well-represented in pretraining corpora, zero-shot models have already learned these patterns. Retail sales with weekly cycles, web traffic with daily patterns, and monthly business KPIs typically fall into this category.
Limited historical data. If you have fewer than 100 time steps of history per series, you do not have enough data to fine-tune without severe overfitting. Zero-shot inference is your best option, and it is often remarkably good in low-data regimes because the model draws on patterns learned from millions of other series.
Rapid prototyping and evaluation. When you are exploring whether TSFMs can solve your forecasting problem at all, zero-shot inference gives you an answer in minutes rather than days. Starting with zero-shot establishes a strong baseline and tells you whether the problem is tractable before you invest in customization.
Large, heterogeneous series collections. If you forecast thousands of diverse series (e.g., SKU-level demand across different product categories), fine-tuning a single model to handle all of them is difficult. Zero-shot models already handle heterogeneity well because they were pretrained on diverse data.
When Fine-Tuning Helps
Fine-tuning delivers measurable improvements in several well-defined scenarios.
Domain-specific temporal patterns. Hospital patient admissions follow weekly cycles that differ substantially from retail or energy patterns: Monday surges from weekend emergency deferrals, Wednesday surgical peaks, Friday discharge rushes. These domain-specific shapes are underrepresented in pretraining data. Fine-tuning on hospital admissions data lets the model learn these patterns explicitly.
Distribution shift from pretraining data. If your time series have statistical properties that are rare in typical pretraining corpora (unusual frequencies, non-standard scales, domain-specific anomaly patterns), the pretrained model's implicit priors may not match your data well. Financial tick data at millisecond resolution, sensor readings from specialized industrial equipment, and agricultural yield data with multi-year cycles are examples where the distribution shift can be significant. Models like Lag-Llama were specifically designed with distributional robustness in mind; see our Lag-Llama overview for details.
Need for maximum accuracy. In applications where a 3-5% improvement in MASE or WAPE translates to significant business value (high-stakes demand planning, energy trading, capacity provisioning), fine-tuning can close the gap between good zero-shot performance and the best achievable accuracy.
Fine-Tuning Approaches
Not all fine-tuning is created equal. The approaches vary in cost, risk, and expected improvement.
Full fine-tuning updates all model parameters on your target data. This gives the model maximum flexibility to adapt but carries the highest risk of catastrophic forgetting, where the model loses its general forecasting abilities while memorizing your specific dataset. Full fine-tuning requires careful regularization, learning rate scheduling, and early stopping. It is rarely the right choice unless you have a very large target dataset (thousands of series, hundreds of time steps each).
Parameter-efficient fine-tuning (PEFT) methods like LoRA freeze most of the pretrained weights and train a small number of additional parameters (typically 1-5% of the original model size). This preserves the model's general knowledge while allowing domain-specific adaptation. LoRA has become the default fine-tuning approach for TSFMs, just as it has for large language models. The risk of catastrophic forgetting is much lower, and training is faster and cheaper.
In-context learning provides example series as context alongside the target series, similar to few-shot prompting in language models. This requires no weight updates at all. Models with long context windows like TimesFM and Moirai can accept additional series as conditioning context. The accuracy improvement is modest compared to PEFT but the operational simplicity is appealing: you do not need a training pipeline at all.
Practical Considerations
How much data do you need? For PEFT fine-tuning, we typically see good results with 100 to 1,000 series and at least 200 time steps per series. Below 100 series, overfitting risk increases sharply even with LoRA. Above 1,000 series, returns diminish and you may not need fine-tuning at all since the model has enough examples to generalize at inference time through in-context learning.
Training cost. Fine-tuning Chronos-Large with LoRA on 500 series takes roughly 30 minutes on a single A10G GPU. Moirai-Large takes slightly longer due to its variate attention mechanism. The compute cost is modest. The real cost is engineering time: building the data pipeline, running evaluation experiments, and maintaining the fine-tuned model over time as your data distribution evolves.
Catastrophic forgetting. Always evaluate your fine-tuned model on both your target domain and a general benchmark (like a subset of Monash). If general accuracy drops by more than 10%, your fine-tuning is too aggressive. Reduce the learning rate, increase LoRA rank, or use fewer training epochs.
Evaluation methodology. Compare fine-tuned versus zero-shot on a held-out test set from your domain. Use temporal splitting (train on earlier data, test on later data) rather than random splitting, since random splits leak temporal information and inflate accuracy estimates.
A Decision Framework
Start with zero-shot inference on your actual data. Measure accuracy against your business requirements. If zero-shot meets your threshold, stop. You save significant engineering overhead and avoid ongoing model maintenance.
If zero-shot falls short, try in-context learning with representative example series. This requires no training and can close a portion of the gap.
If in-context learning is still insufficient, run a PEFT fine-tuning experiment. Compare the fine-tuned model against zero-shot on both your domain test set and a general benchmark. If the domain improvement is meaningful and general performance is preserved, deploy the fine-tuned model.
Only consider full fine-tuning if PEFT does not close the gap and you have a large, high-quality target dataset with a clear business case for the marginal improvement.
Managed Fine-Tuning on TSFM.ai
TSFM.ai offers managed fine-tuning through our API. You upload your training data, specify the base model, and we handle the LoRA configuration, training, evaluation, and deployment. The fine-tuned model appears in your model catalog and can be used through the same inference API as any other model. Try zero-shot inference first in the Playground. This removes the infrastructure burden while giving you the accuracy benefits of domain adaptation. Reach out to our team if you want to explore fine-tuning for your use case.