Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1708

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Luke Huang ⋅ Zhuoyang Zhang ⋅ Qinghao Hu ⋅ Shang Yang ⋅ Song Han

Project Page

Abstract

Asynchronous reinforcement learning has become increasingly central to scaling LLM post-training, delivering major throughput gains by decoupling rollout generation from policy updates. However, widely used policy-gradient objectives such as REINFORCE and GRPO suffer under high asynchrony: stale rollouts produce heavy-tailed importance weights, so a small number of trajectories dominate updates and the policy-gradient estimator becomes markedly higher variance. Through systematic analysis on math, reasoning, and tool-use benchmarks, we establish that this increasing variance is reliably predicted by collapsing effective sample size (ESS), which prior stabilization methods largely fail to address. Motivated by this diagnosis, we introduce Variance Controlled Policy Optimization (VCPO), a method that (i) dynamically scales the learning rate with ESS to dampen unreliable updates and (ii) applies a closed-form minimum-variance baseline for off-policy settings, without a critic model and adding minimal overhead. Empirically, across math and general reasoning benchmarks, this enables robustly stable asynchronous training compared to previous stabilization and algorithmic methods, even in highly off-policy regimes (128 steps off-policy). In a long-horizon, tool-use task, VCPO matches synchronous performance while delivering a 2.5× speedup in training time. Code is available at: https://github.com/mit-han-lab/vcpo

Lay Summary

Async RL is now standard for most large-scale post-training runs, offering 2–3x speedups over synchronous RL. Clipping and masking importance weights can stabilize training at low policy lag. But at high policy lag, which unlocks further speedups at longer contexts, none of them are robust. We found two root causes: - ESS (Effective Sample Size, how many of your training samples contribute to your gradien) silently collapses, and is a reliable leading indicator of training collapse. - A few trajectories dominate the gradient update, leading to high-variance gradient estimates. Introducing Variance Controlled Policy Optimization (VCPO): 1. ESS-Adaptive LR: Take smaller steps when your batch becomes noisier and lower effective sample size. 2. OPOB: A closed-form minimum-variance baseline, exact and not approximate, with only 19% compute overhead via a single backward pass. Result: Synchronous-level accuracy with 2.5x the wall-clock throughput.