EmWorld: Emotion World Model with Latent State Evolution for Scenario-Incremental Dynamic Facial Expression Recognition
Abstract
Dynamic Facial Expression Recognition (DFER) models the temporal evolution of facial expressions in videos. In real-world deployments, changing scenarios distort expression trajectories over time, making it difficult for existing methods to maintain performance. While most current approaches address this issue through passive feature alignment across scenarios or domain-incremental learning techniques that preserve previously learned representations, they do not explicitly model scenario evolution over time, limiting their ability to robustly capture expression dynamics under scenario-incremental changes. To end this, we propose EmWorld, an emotion world model for DFER that explicitly models latent emotion state evolution under scenario variations. Specifically, EmWorld formulates scenario-incremental DFER as a progressive Bayesian inference problem over latent world states with dual temporal scales. Slow-timescale component (STS) models scenario evolution using stochastic evolutionary priors, capturing long-term scenario effects and providing proactive guidance in new scenarios. Fast-timescale component (FTS) models frame-level expression dynamics with temporally consistent latent transitions, effectively decoupling expression dynamics from scenario influences. By jointly inferring latent states at both timescales, EmWorld, shifts DFER from a passive feature discrimination process to a active probabilistic state inference under evolving scenarios. Experiments on FERV39k, DFEW, and MAFW demonstrate that EmWorld, consistently outperforms state-of-the-art methods, achieving up to 3.84\% improvement while exhibiting strong cross-scenario stability and long-term robustness.