Multimodal Time Series: Combining Numerical Data with Text and Natural Language Context
What if forecast models could read a description of your data alongside the numbers? Multimodal time series research is exploring how natural language context — domain descriptions, known events, business constraints — can improve forecasts, especially in zero-shot settings.
A grocery store in the rural Midwest sees a sharp sales spike every November. A wind farm on the North Sea coast produces erratic output whenever Atlantic storms shift north. An e-commerce platform's conversion rate drops predictably during back-to-school season, then surges on Black Friday. In each case, the numerical series alone tells part of the story, but a single sentence of context would tell a model exactly what to expect. The emerging field of multimodal time series forecasting asks a straightforward question: what happens when models can consume text alongside numbers?
Today's time series foundation models treat every input series as an unlabeled sequence of floating-point values. A model receiving daily retail sales data has no way to know whether those numbers represent grocery revenue in Wisconsin or hotel bookings in Tokyo. It cannot know that a holiday is approaching, that a competitor just opened nearby, or that the series was collected from a sensor with known calibration drift. All of that context lives in the forecaster's head, not in the model's input. Multimodal approaches aim to close that gap.
Natural Language as Domain Context
The simplest form of multimodal input is a text description attached to the series. Rather than structured covariates like temperature columns or holiday flags, a natural language prompt provides open-ended context: the data source, its domain, known upcoming events, business constraints, and expected behavior. A prompt like "daily electricity demand for a university campus in Texas, spring semester starting January 15" carries information that would be difficult to encode as a fixed set of numerical covariates but is trivially expressed in a sentence.
This matters most for zero-shot forecasting, where the model encounters a series it has never trained on. Without text context, a zero-shot model must infer domain, frequency, scale, and seasonality entirely from the numerical pattern. With even a brief description, the model gains access to prior knowledge about how grocery sales, wind generation, or web traffic typically behave. The potential accuracy improvement is largest precisely where foundation models struggle most: short input histories, ambiguous frequencies, and domains far from their pretraining distribution.
Three Research Approaches
The research community has explored three distinct strategies for combining text and time series, each reflecting a different philosophy about where language fits in the forecasting pipeline.
LLM reprogramming. Time-LLM keeps a large language model entirely frozen and learns a lightweight adapter that maps time series patches into the LLM's token embedding space. A natural language prompt is prepended to the reprogrammed sequence, steering the frozen model's attention toward domain-appropriate reasoning. The approach is detailed in our LLM reprogramming deep dive. Time-LLM demonstrated that text prompts measurably improve forecasts when the LLM backbone is large enough to have internalized domain knowledge during pretraining. The key insight is that tokenization into the LLM's native space, combined with a textual prefix, bridges the modality gap without retraining billions of parameters.
Text-conditioned forecasting. UniTime and PromptCast take a more integrated approach. UniTime trains a unified model across multiple domains and uses natural language domain instructions as conditioning signals, allowing a single architecture to specialize its behavior based on a text description of the data source. PromptCast goes further, framing forecasting as a question-answering task where both the input series and the forecast are expressed in natural language. These models treat text not as a secondary signal but as a first-class input modality that shapes the forecast distribution directly.
Retrieval-augmented forecasting. A third, less formalized approach retrieves relevant text metadata at inference time. Given an input series and a minimal description, a retrieval system pulls domain-specific context from a knowledge base — historical event logs, sensor documentation, or seasonal calendars — and feeds it to the model alongside the numerical data. This mirrors retrieval-augmented generation in LLMs and is particularly promising for enterprise settings where extensive domain documentation already exists but is not captured in any structured covariate format.
Why This Matters Now
The timing is not accidental. Several trends are converging. First, foundation models have reached the point where zero-shot forecasting works well enough for production use, but accuracy gaps remain on out-of-distribution domains — exactly where text context would help most. Second, the broader TSFM landscape is moving toward models that accept richer inputs: structured covariates are already mainstream, and text is the next logical modality. Third, the success of multimodal LLMs in vision-language tasks has demonstrated that cross-modal conditioning works at scale, providing both architectural blueprints and engineering intuitions that transfer to the time series setting.
The practical implication is that domain expertise, which today lives in documentation, runbooks, and the heads of analysts, could become a direct model input. A forecasting API that accepts a context field alongside the numerical payload would let users communicate knowledge that no amount of historical data can provide: planned store closures, upcoming regulatory changes, or the fact that last year's anomaly was caused by a one-time data pipeline failure.
Current Limitations
Despite the clear trajectory, significant gaps remain. No widely adopted forecast API supports free-text context fields today. Benchmark datasets are almost exclusively numbers-only, making it difficult to evaluate multimodal approaches rigorously or compare them against numerical baselines under controlled conditions. The benchmarking challenges that already complicate TSFM evaluation become harder when one model receives text context and another does not.
There is also an open question about how much text actually helps versus how much it introduces noise. A vague or misleading description could steer the model in the wrong direction, and there is no established methodology for validating text-context quality the way there is for numerical covariates. Early results are encouraging but narrow, typically demonstrated on a handful of curated datasets where the text descriptions are carefully crafted by researchers.
Where TSFM.ai Is Heading
We see multimodal context as a natural extension of the platform. Our API already supports structured covariates, and the architecture is designed to accommodate additional input modalities. Future releases will introduce metadata fields for natural language task descriptions, domain tags, and event annotations that models can consume alongside numerical series. The goal is straightforward: let users pass the context they already have — a sentence, a paragraph, a set of known events — and let the model use it.
This is early-stage research, and we are not overselling where the field stands today. But the trajectory is clear. The models that define the next generation of time series forecasting will not treat numbers in isolation. They will read the context, understand the domain, and forecast accordingly. You can explore the current generation of models, including those with LLM backbones, in our model catalog.