Poster
in
Affinity Event: GlobalSouthML @ ICML 2026

Semantics or Structure? Auditing Text Sensitivity in Multimodal Time-Series Forecasting

Karthik Sridhar ⋅ Atharva Gupta ⋅ Nishant Pradhan ⋅ Murari Mandal ⋅ Dhruv Kumar ⋅ Saurabh Deshpande

Project Page

Abstract

Multimodal time-series forecasting is a promising paradigm in which natural-language text is expected to improve forecasting accuracy. The multimodal foundation model Aurora and the late- and early-fusion paradigms MMTSFlib and TaTS all report significant improvements over unimodal baselines on the Time-MMD benchmark, and attribute these gains to the text. Whether these models are sensitive to the \emph{content} of the text they receive has not been tested directly. We answer this question through a controlled text perturbation study, complemented by an attribution analysis of a numeric column shipped alongside the text, gradient and attention probes of Aurora's text pathway, and dataset-level structural diagnostics. On TimeMMD, swapping each row's text for any other real text (empty, constant, within-domain shuffled, or cross-domain) moves mean MSE by less than $0.5$% on all three architectures. The improvement reported in the literature is recovered when a co-shipped numeric column is removed without touching text. We conclude that, on this benchmark and within this family of frozen-encoder architectures, text content is not the operative signal behind the reported gains. We hope these findings inform the design of future multimodal foundation models for structured data.