Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings
Abstract
Traditional time series models are often task-specific and rely heavily on manual feature engineering. While Transformer-based architectures have revolutionized sequence modeling in language and vision, their potential for general-purpose time series representation learning remains underexplored, particularly for heterogeneous sensor data. We introduce CHARM (Channel-Aware Representation Model), a model designed to improve representations for multivariate time series by incorporating channel-level textual descriptions into its architecture. This allows the model to leverage contextual information associated with individual sensors while remaining invariant to channel order. CHARM is trained using a Joint Embedding Predictive Architecture (JEPA) with a novel loss that promotes informative and temporally stable embeddings. We find that CHARM’s latent-space prediction encourages robustness to sensor-level noise and supports learning underlying temporal structure. In addition, the description-aware gating mechanism provides a degree of interpretability through learned inter-channel relationships. Across a range of downstream tasks—including univariate and multivariate anomaly detection, classification, and short- and long-term forecasting—the learned embeddings achieve strong performance using only a lightweight linear probe.