CausalX: A Unified and Causally-Interpretable Plug-and-Play Model for Multi-modal Spatio-Temporal Forecasting
Abstract
Multi-modal spatio-temporal forecasting underpins many real-world applications but remains challenging due to the complex and evolving interactions across modalities and time steps. Moreover, the lack of interpretability in existing models limits their reliability in safety-critical scenarios. In this paper, we present CausalX, a unified and causally interpretable plug-and-play model for multi-modal spatio-temporal forecasting. CausalX achieves interpretability by learning a dynamic causal graph across modalities and time, whose edge weights quantify causal attribution strength, and are further refined by a diffusion-based generative process guided by structural priors. To overcome the absence of ground-truth causal structures, CausalX aggregates multi-source constraints from causal analysis techniques and a variational autoencoder, spanning predictive, temporal, interventional, and generative aspects to jointly learn a more comprehensive causal graph. Extensive experiments on real-world forecasting tasks, including pedestrian trajectory prediction and tropical cyclone forecasting, demonstrate that CausalX achieves superior accuracy while producing interpretable causal graphs. CausalX is modular, architecture-agnostic, and generalizable, offering a new perspective on bridging causal inference and spatio-temporal forecasting.