VIBE: Disentangling Social Dynamics via Kinematics-Informed Variational Inference for Behavioral Emotion
Abstract
Group Emotion Recognition (GER) is crucial for understanding social dynamics, ranging from interpreting intimate conversations to evaluating crowd behavior in large-scale surveillance scenarios. While current AI models can analyze these scenes, they often act as black boxes that take shortcuts. Instead of focusing on how people are actually behaving, these models often get distracted by the background environment, leading to inaccurate results. To bridge this gap, we introduce VIBE (Variational Inference for Behavioral Emotion), a kinematics-aware framework that integrates audio, video, and text modalities through causal structuring. Unlike standard models that simply mix data together, VIBE utilizes mathematical constraints to filter out background noise and isolate the genuine emotions of the people involved. This purified representation enables our model to focus exclusively on the sociological mechanics of the crowd, dynamically modulating neural attention based on raw physical synchrony. Simultaneously, we align visual dynamics with human interpretability by projecting latent representations into a semantically structured space informed by textual descriptions. Comprehensive experiments demonstrate that VIBE consistently outperforms state-of-the-art methods. Code will be made publicly available upon acceptance.