World-Model Inspired Emotion-aware Token Refinement for Training-Free Multimodal Emotion Recognition
Abstract
Multimodal Large Language Models (MLLMs) show promise for Multimodal Emotion Recognition (MER) but often remain unreliable because sparse emotional cues could be easily overwhelmed and affected by redundant context. While fine-tuning is effective, it is usually costly when using large models. Training-free methods like chain-of-thought reasoning provide a practical alternative, but they mostly rely on heuristic prompting to influence the model behaviors and do not explicitly focus on emotion relevant tokens internally, which would allow decision-relevant emotional tokens to be diluted by environmental noise, resulting in unstable predictions. To address this limitation without training, we rethink MER from a world-model perspective that treats emotion as a latent state inferred from noisy and redundant multimodal observations. Under frozen parameters, this view suggests that robustness depends on constraining why and how tokens contribute to inference. Based on this insight, we propose WETR (World-Model inspired Emotion-aware Token Refinement), a training-free, plug-and-play regulator that reshapes token usage through two mechanisms: Noise-suppressed Token Selection (NTS), which suppresses redundant intra-modal noise, and State-strengthened Token Reweighting (STR), which amplifies decision-relevant emotional tokens. Experiments on multiple MER benchmarks demonstrate that WETR consistently improves accuracy and stability under frozen parameters, which also improves token-level interpretability.