Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Learning from World Feedback: Why Model Uncertainty Fails as a Risk Signal in Model-Based RL

Zhaohui Wang

Project Page

Abstract

The RLxF programme argues that learning signals should come from world feedback rather than from internal model proxies. We instantiate this position in safe model-based control and distill it into three concrete design principles. Empirically, across four world-model architectures spanning a 2× MSE range, MPC planning is statistically equivalent (TOST, n = 200), and dynamics-based uncertainty penalties increase collision rates from 26% to 34%: the standard MBRL safety proxy is anti-correlated with safety in this regime. Replacing the model-internal proxy with three world-feedback signals—a sensor-derived margin via minimum lidar, a temporal signal via time-to-collision, and an outcome-supervised feedback model gψ trained on prior collision labels (structurally analogous to outcome-trained reward models in RLHF)—reduces collisions to 1–14% without retraining the world model or the planner. The mechanism is structural: model uncertainty has support over state-prediction space, whereas task risk has support over constraint boundaries, with empirical correlation r < 0.15. From this, we extract three RLxF principles (ground risk in world outcomes, validate proxies before deployment, and substitute outcome-trained feedback models when direct world signals are unavailable) and argue they apply equally to model-based control and to verifier-based or RLHF approaches in LLM alignment.