Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

When Is World Feedback Transferable? A Convergence Gate in Contrastive Reinforcement Learning

Bruce C Xu ⋅ Jay J Park ⋅ Vivek Buch

Project Page

Abstract

World feedback, meaning measurable signals from agent-environment interaction such as goal reaching, future-state prediction, or system metrics, offers a path to scalable reinforcement learning that does not depend on human annotation. A standard recipe for deploying world-feedback agents is to amplify a strong teacher into smaller students via distillation. We show this recipe silently fails when the world-feedback signal is contrastive. In contrastive RL (CRL), where the teacher learns from an InfoNCE objective over future states, student performance is nearly constant across a wide range of partially trained teachers and then increases sharply once the teacher crosses a convergence gate: a sharp threshold in world-feedback representation quality. The gate is robust across three independent teachers, predictable from a zero-cost behavioral diagnostic ($\rho > 0.7$, $\mathrm{AUC} = 0.895$), and causally controlled by contrastive discrimination difficulty. Varying the InfoNCE temperature shifts the gate by 13 epochs. Once the gate closes, amplification is dramatic: on Humanoid, distillation transforms a near-zero baseline ($6.4$) into a walking policy at $92\%$ of teacher performance. By contrast, when the world-feedback signal is a scalar reward, SAC distillation is smooth and gate-free, transferring an $11\times$ smaller student to $96.6\%$ of teacher performance. The form of world feedback (dense contrastive versus scalar) can therefore determine whether amplification is smooth or gated. We close with operational guidance for world-feedback pipelines: cheap behavioral statistics on the teacher, such as mean action magnitude and output entropy, are stronger checkpoint-selection signals than validation loss, and contrastive temperature should be treated as an explicit knob over the gate's location. These results suggest that world feedback is not interchangeable. Its form governs whether a trained agent is a teacher one can reliably build on.