Rate or Fate? RLV$^{\varepsilon}$R: Reinforcement Learning with Verifiable Noisy Rewards
Ali Rad ⋅ Khashayar Filom ⋅ Darioush Keivan ⋅ Peyman Mohajerin Esfahani ⋅ Ehsan Kamalinejad
Abstract
Reinforcement learning with verifiable rewards (RLVR) trains a policy by verifying sampled completions and reinforcing higher-scoring outputs, but practical verifiers (e.g., incomplete unit tests or noisy judges) are prone to false positives and false negatives. We ask when such noise merely slows learning and when it reverses it. Modeling GRPO-style RLVR as a bandit over recurring \emph{reasoning modes}, we derive mean-field replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index $J=\mathrm{TPR}-\mathrm{FPR}$. This yields a sharp phase transition: when $J>0$, the incorrect mass is driven toward extinction (learning); when $J=0$, the process is neutral; and when $J<0$, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime $J>0$, noise primarily rescales convergence time (``rate, not fate''). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted $J=0$ boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.
Successful Page Load