Reward, or the Observation Stream? Auditing World Model Quality in RL on a Turbofan Substrate
Abstract
Learning agents in partially observable environments need internal models of the world to act well and to support downstream tasks. What signal should shape those models? One view holds that maximising reward in a loop of action and observation is sufficient: a good policy implies a good internal model as a byproduct. An opposing view holds that reward shapes representations toward a minimum policy-sufficient statistic and discards the rest; faithful world models must emerge instead from self-supervised prediction on the observation stream. We arbitrate between the two on TurboSens, a turbofan engine simulator in which the true latent state is exposed as ground truth, so an encoder's learned representation can be probed directly against the state it should have recovered. The latent has two parts: an underlying wear state that drifts slowly, and a fouling layer (dust and deposits on engine components) that masks it and which the agent's own maintenance actions periodically clear. Together, Turbosens produces dependant noisy sensor observations across multiple operating contexts that the model observes. We find that the two objectives shape the encoder along orthogonal axes. PPO tracks the fouling, because the policy must react to it, and discards the underlying wear. Self-supervised pretraining (JEPA) does the opposite: it recovers the wear better but is invariant to fouling. Both build internal models of the engine, but of different parts. We show that the two compose cleanly sequentially rather than jointly: pretrain the encoder once with JEPA, freeze it, then train PPO on top per task. The recipe matches end-to-end PPO on the default reward, improves wear recovery, and yields a reusable encoder. We discuss implications for representation learning in safety-critical domains.