ConsMSA: Semantic Distribution Consistency Learning for Multimodal Sentiment Analysis
Pan Wang ⋅ Lipeng Ke ⋅ Huajun Ying ⋅ Pritish Mohapatra ⋅ Rohan Sarkar ⋅ Suresh Lakhani ⋅ sankar venkataraman ⋅ Jingtong Hu
Abstract
Multimodal sentiment analysis (MSA) aims to predict human sentiments by integrating signals from different modalities such as text, video, and audio. However, raw multimodal sequences often suffer from semantic inconsistencies--exhibiting redundancy or conflicts within and across modalities--which hinders robust understanding and increases computational cost. To this end, we introduce ConsMSA, which explicitly formalizes semantic distribution consistency across both \textit{intra}- and \textit{inter}-modality, providing a principled mechanism for robust and efficient multimodal sentiment prediction. Specifically, ConsMSA projects multimodal token features into a shared sentiment space to compute an Intra- and Inter-modality Consistency Score ($I^2CS$). By coupling this score with predictive relevance, we formulate principled importance signals that are utilized: (i) as a consistency regularizer to align latent distributions during training, (ii) to derive semantic-aware weights for adaptive multimodal token reweighting, and (iii) as a principled criterion to prune redundant or conflicting tokens. Extensive experiments on CMU-MOSI and CMU-MOSEI demonstrate that ConsMSA achieves state-of-the-art performance while remaining robust under aggressive token compression--retaining only 10\% of tokens yields comparable accuracy. These results establish semantic distribution consistency as a principled foundation for synergizing predictive robustness with computational efficiency.
Successful Page Load