Case Study: Energy Demand Forecasting
How a European energy utility used TSFM.ai to improve demand forecasts by 23% compared to their existing gradient boosted tree pipeline.
Case Study: Energy Demand Forecasting
Accurate demand forecasting is the foundation of energy grid operations. Overestimate demand and you waste money on unnecessary generation capacity. Underestimate it and you risk blackouts or expensive spot-market purchases. For a mid-size European energy utility operating across 150 grid zones, getting these forecasts right is worth millions of euros per year.
This is the story of how that utility replaced a complex ensemble of gradient boosted tree models with zero-shot inference from time series foundation models, improving forecast accuracy by 23% while cutting engineering overhead dramatically.
The Existing Pipeline
The utility's forecasting team had built their system over four years. At its core sat a collection of LightGBM models, one per grid zone, each trained on that zone's historical demand data. Feature engineering was extensive: temperature forecasts from three weather providers (averaged to reduce bias), day-of-week and hour-of-day encodings, national and regional holiday calendars, sunrise and sunset times, and lagged demand values at 1-hour, 24-hour, and 168-hour (one week) offsets.
The system worked reasonably well for established zones with years of history. But it had real problems.
Maintenance burden. Each zone's model was retrained monthly on a rolling 18-month window. With 150 zones, that meant managing 150 separate training jobs, monitoring for data quality issues, and debugging when individual models degraded. Two full-time engineers spent most of their time on pipeline maintenance rather than improving accuracy.
Cold-start failures. When the utility expanded into new regions or restructured grid zones, the new zones had no historical data. The team's workaround was to clone a "similar" zone's model and manually adjust, but similarity was defined by gut feel rather than any systematic method. Forecast quality for new zones was poor for the first six to twelve months.
Onboarding latency. Adding a new region required building a feature pipeline for that region's weather data, sourcing the correct holiday calendar, and waiting for enough data to train a stable model. The end-to-end process took roughly three weeks per zone.
Evaluating TSFMs
The forecasting team ran a structured evaluation over six months of held-out data (January through June 2024) across all 150 zones. They compared three foundation models available through TSFM.ai against their production LightGBM ensemble.
The evaluation used MASE (Mean Absolute Scaled Error) as the primary metric, since it normalizes across zones with different demand magnitudes and is the standard metric from the Monash forecasting benchmark. They also tracked WAPE (Weighted Absolute Percentage Error) and coverage of 80% prediction intervals.
Chronos-Large achieved a mean MASE of 0.81 across all zones, a 23% improvement over LightGBM's 1.05. It performed strongest on zones with regular weekly seasonality and moderate trend, which described roughly 60% of the portfolio.
TimesFM came in at 0.87 MASE, a 17% improvement. It handled high-frequency intraday patterns slightly better than Chronos but was less robust on zones with strong trend components.
Moirai-Large scored 0.84 MASE. Its multivariate capability was tested by feeding correlated neighboring zones as covariates, which improved accuracy on zones near industrial clusters where demand is spatially correlated.
All three foundation models outperformed LightGBM on new zones with less than three months of history, where the accuracy gap widened to over 30%.
Deployment
The utility integrated TSFM.ai through the REST API, replacing their inference pipeline while keeping their existing data ingestion and monitoring infrastructure. The key architectural decisions:
Forecast refresh cadence. Hourly demand forecasts are generated every six hours with a 48-hour prediction horizon. Each refresh takes roughly 12 seconds for all 150 zones using batched inference through the Chronos-Large endpoint.
Model selection. After the evaluation, the team uses Chronos-Large as the default model for most zones, with Moirai-Large for the 30 zones where spatial correlation proved beneficial. TSFM.ai's model routing handles this selection based on zone metadata tags.
Monitoring. Forecast accuracy is tracked daily against realized demand. The team set up alerts for any zone where rolling 7-day MASE exceeds 1.2, which triggers a review. In six months of production use, this alert has fired for only 4 zones, all during unusual weather events.
Prediction intervals. The 80% prediction intervals from Chronos achieved 82% empirical coverage, slightly conservative, which the operations team prefers over overconfident intervals that miss extremes.
Results
After six months in production, the impact is measurable across three dimensions.
Forecast accuracy improved by 23% on average, with the largest gains in zones that previously had the least historical data. Engineering time dedicated to model maintenance dropped by roughly 40%, freeing one engineer to work on adjacent problems like renewable generation forecasting. New zone onboarding went from three weeks to under four hours, limited mainly by setting up the data ingestion pipeline rather than any model training.
Lessons Learned
The team identified several insights that may generalize to other forecasting deployments. First, feature engineering is not free. The LightGBM pipeline's weather and calendar features cost significant engineering effort, and the foundation models matched or exceeded that accuracy without any exogenous features. Second, zero-shot does not mean zero-effort. Data quality, particularly handling missing values and timezone alignment, still matters and consumed the bulk of the integration work. Third, the ensemble of specialized models can be outperformed by a single general-purpose foundation model when that model has been pretrained on sufficiently diverse data.
The utility plans to extend their TSFM.ai usage to renewable generation forecasting and is evaluating fine-tuning Chronos on their historical demand data to see if domain-specific adaptation can close the remaining gap to theoretical forecast limits. You can experiment with similar forecasting workflows in the Playground.