Semantics or Structure? Auditing Text Sensitivity in Multimodal Time-Series Forecasting
Karthik Sridhar ⋅ Atharva Gupta ⋅ Nishant Pradhan ⋅ Murari Mandal ⋅ Dhruv Kumar ⋅ Saurabh Deshpande
Abstract
Multimodal time-series forecasting is a promising paradigm in which natural-language text is expected to improve forecasting accuracy. The multimodal foundation model Aurora and the late- and early-fusion paradigms MMTSFlib and TaTS all report significant improvements over unimodal baselines on the Time-MMD benchmark, and attribute these gains to the text. Whether these models are sensitive to the \emph{content} of the text they receive has not been tested directly. We answer this question through a controlled text perturbation study, complemented by an attribution analysis of a numeric column shipped alongside the text, gradient and attention probes of Aurora's text pathway, and dataset-level structural diagnostics. On TimeMMD, swapping each row's text for any other real text (empty, constant, within-domain shuffled, or cross-domain) moves mean MSE by less than $0.5$% on all three architectures. The improvement reported in the literature is recovered when a co-shipped numeric column is removed without touching text. We conclude that, on this benchmark and within this family of frozen-encoder architectures, text content is not the operative signal behind the reported gains. We hope these findings inform the design of future multimodal foundation models for structured data.
Successful Page Load