Seeing is Solving: Unlocking Efficient Multimodal RL via View Alignment
Abstract
Although Reinforcement Learning Fine-Tuning (RLFT) applied to Vision-Language Models (VLMs) substantially enhances multimodal reasoning capabilities, their prohibitive training cost limits broad adoption. Surprisingly, most existing methods simply port Large Language Model (LLM) RLFT techniques to VLMs, while ignoring a intrinsic property of multimodal models: their dynamic text–vision alignment. We ask a new question: Can this intrinsic alignment be turned into a training signal that makes VLM RLFT more efficient? We analyze how a VLM plans to attend, actually attends, and ideally should attend during reasoning, and derive two lightweight metrics from these patterns. Predictive View Accuracy (PVA) estimates sample difficulty, and Reasoning View Accuracy (RVA) reflects the quality of chain-of-thought (CoT) reasoning. These alignment signals enable automated data curriculum and dense reasoning supervision. We introduce FOCUS-RL, a plug-and-play framework that can be seamlessly integrated into any VLM and dramatically boosts RLFT training efficiency. FOCUS-RL achieves 2.5 x – 4 x faster convergence over vanilla GRPO and consistent accuracy gains (+4.4 on average) across six different benchmarks and multiple VLM families.