Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF
Arnav Raj
Abstract
Reinforcement learning from human feedback (RLHF) in production does not always have a synchronous reward signal. Code-execution verifiers, slow judge ensembles, and queued human review can return several gradient steps after the rollout that produced them, breaking the synchronous-reward assumption underlying standard PPO. We address this gap with Retroactive Advantage Correction (RAC): each pending slow completion is queued, aged through a non-negative kernel, and reinjected as a clipped residual into the next optimiser step's advantage. We prove that the cumulative RAC correction has bias linear in the row-stochasticity slack of the effective delay kernel, and is exact (zero in expectation) at the saturated kernel. At the identity kernel, the result reduces to the on-policy guarantee of V-trace, the canonical clipped importance-sampling corrector for off-policy actor-critic targets. A complementary total-variation bound combines Pinsker's inequality with the Bretagnolle-Huber lemma to control per-step policy divergence. On a tabular Markov decision process (MDP) proof-of-concept, RAC reduces the closed-form policy bias by up to $\{47.9\times}$ at the two-slow-channel deployment configuration, achieving higher bias-reduction than wait-for-slow at lower wall-clock cost. Three $7$B-scale checks confirm the underlying algebraic identity holds to machine precision on real reward distributions. RAC integrates with PPO and GRPO at the reward-manager interface through a two-line patch.
Successful Page Load