Learning from World Feedback: Why Model Uncertainty Fails as a Risk Signal in Model-Based RL
Abstract
The RLxF programme argues that learning signals should come from world feedback rather than from internal model proxies. We instantiate this position in safe model-based control and distill it into three concrete design principles. Empirically, across four world-model architectures spanning a 2× MSE range, MPC planning is statistically equivalent (TOST, n = 200), and dynamics-based uncertainty penalties increase collision rates from 26% to 34%: the standard MBRL safety proxy is anti-correlated with safety in this regime. Replacing the model-internal proxy with three world-feedback signals—a sensor-derived margin via minimum lidar, a temporal signal via time-to-collision, and an outcome-supervised feedback model gψ trained on prior collision labels (structurally analogous to outcome-trained reward models in RLHF)—reduces collisions to 1–14% without retraining the world model or the planner. The mechanism is structural: model uncertainty has support over state-prediction space, whereas task risk has support over constraint boundaries, with empirical correlation r < 0.15. From this, we extract three RLxF principles (ground risk in world outcomes, validate proxies before deployment, and substitute outcome-trained feedback models when direct world signals are unavailable) and argue they apply equally to model-based control and to verifier-based or RLHF approaches in LLM alignment.