EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding
Abstract
A precise and comprehensive understanding of human-environment interactions in egocentric vision is essential for next-generation intelligent agents, such as assistive robotics. While existing multimodal large language models (MLLMs) support unified reasoning from scene-level analysis to instance-specific grounding, their accuracy and generalization remain limited. To this end, this paper introduces a novel Egocentric Analysis-guided RL-based method (EARL) that employs Group Relative Policy Optimization (GRPO) to enhance the interaction understanding of MLLMs in first-person vision. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the language answer and corresponding pixel-level grounding mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor from the first stage and treat it as a semantic prior, which is then integrated via a novel Analysis-guided Feature Synthesizer (AFS) to support query-oriented reasoning. Furthermore, to effectively guide policy optimization, we design a sophisticated, multi-faceted reward mechanism that incorporates format correctness, answer relevance, and grounding accuracy. Experimental results demonstrate that EARL achieves an impressive 65.48% cIoU on the Ego-IRGBench benchmark for pixel grounding, surpassing previous state-of-the-art RL-based methods by 8.37%. Superior performance in out-of-distribution evaluations further validates EARL's generalization capability.