The Feedback-Operator Theorem: Identifying Welfare Orders in Reinforcement Learning from World Feedback
Manoj Saravanan
Abstract
We study reinforcement learning from world feedback through a consequence-space formulation in which each policy induces a consequence vector and heterogeneous feedback channels act on attainable consequence differences. Rather than asking whether a latent scalar reward is identifiable, we ask whether the welfare preorder on policies is identifiable. Our first result is a feedback-operator factorization theorem: exact identifiability holds if and only if the welfare label factors through the restricted feedback operator on the attainable difference set. In the full-dimensional polyhedral regime, this becomes a facet-observability theorem: exact identifiability is equivalent to observability of every order-separating facet normal of the welfare cone. Under Gaussian paired-difference feedback, each facet carries an intrinsic noise level $\sigma_j^2 = w_j^\top (A_V^*\Sigma^{-1}A_V)^\dagger w_j$, and the hardest observable facet governs the minimax difficulty of welfare-label recovery. We derive an exact two-point lower bound and a matching observable cone test, optimal up to the logarithmic factor required to control all facet inequalities simultaneously. Finally, we prove that exact order identifiability can hold even when no scalar Markov reward represents the same policy preorder. These results reframe world-feedback RL as an inverse problem over welfare orders rather than rewards.
Successful Page Load