Ego3S: Select, Strengthen, and Synchronize for Efficient Egocentric Reasoning
Abstract
Egocentric reasoning fundamentally differs from third-person understanding in LVLMs. Third-person settings offer wide and stable contexts with consistent global regularities, allowing models to utilize broad statistical correlations. In contrast, egocentric scenes are highly dynamic and heterogeneous, where decisive cues are localized and atypical. Therefore, robust egocentric reasoning requires models to focus on ''what is seen now'', i.e., the immediate visual input. However, existing methods tend to exhibit "inertial thinking'', relying excessively on language priors and global context. To address this limitation, we propose a novel three-stage Ego3S framework to ground models' reasoning in interaction evidence. Specifically, before training, we first utilize the counterfactual-based paradigm to select high-value samples that effectively activate multimodal reasoning, thus mitigating the over-reliance on language priors and global context. Moreover, we introduce an interaction-centric reward for reinforcement learning that strengthens the model’s sensitivity to localized interaction cues. Finally, during training, we employ a variance-aware learning schedule that monitors reward distributions to dynamically synchronize data selection with the evolving model competence. Experiments on five datasets show that our Ego3S consistently achieves superior performance using only 26.5% of the training data, while reducing computational costs by over 46%. Code is available at https://anonymous.4open.science/r/Ego3S-70A2.