Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
Abstract
Vision-language-action (VLA) policies are post-trained with reinforcement learning (RL) that uses binary task outcomes as world feedback, but this stage is computationally expensive. A natural response has been to speed up rollout data collection through faster simulators and world models. In GRPO-based VLA RL, however, we find that the dominant cost lies elsewhere: gradient computation accounts for ~78% of wall-clock time per step in our runs, while rollout collection accounts for only ~21%. Gradient cost dominates because much of this computation is spent on phases that contribute little to learning. GRPO's learning signal is driven by advantage variance: only phases where successful and failed rollouts diverge produce useful learning signal. However, GRPO assigns the same advantage to every chunk in a rollout. As a result, actor-update compute is spent uniformly across the trajectory, including phases the policy already handles after pre-training and supervised fine-tuning. This paper presents Probabilistic Chunk Masking (PCM), a drop-in modification to GRPO that allocates gradient computation to a small, probabilistically selected subset of chunks per trajectory. PCM scores semantic phases using success-failure action variance, a rollout-derived proxy for per-phase gradient variance, and samples a fixed chunk budget with online-updated phase-level keep probabilities. We formalize per-phase gradient variance as the quantity that determines where gradient computation is useful and show that success-failure action variance provides a measurable proxy for it. PCM requires no reward model or learned critic, exploiting the structure of binary world-feedback signals directly. On three LIBERO benchmarks, PCM matches the final success rate of standard GRPO while achieving 2.38x wall-clock speedup, 4.8x faster gradient updates, and 60% lower peak activation memory, backpropagating through fewer than 20% of trajectory chunks.