Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

Coverage Cliffs in Learning from Logged World Feedback

Pauline Bourigault

Project Page

Abstract

World feedback, including losses, costs, failures, latencies, health outcomes, or economic impacts measured from an environment, is a promising complement to human-preference feedback. Yet it is often logged under earlier policies, so its usefulness for a new policy is limited by off-policy coverage. We study this issue in logged contextual bandits. Instead of certifying mean loss, we ask when logged world feedback can certify upper-tail loss of a learned randomized policy. Our main result is a finite-sample PAC-Bayes certificate for posterior-averaged per-policy Value-at-Risk, obtained from posterior-uniform importance-weighted CDF control and quantile inversion. The analysis identifies a quantile-specific coverage cliff: if the logged feedback lacks enough weighted mass near the target quantile, the certified threshold must jump to a larger loss or become vacuous. Diagnostics show one theorem-covered continuous-outcome regime where the certificate is non-vacuous but conservative, and discrete or weak-coverage regimes where empirical improvements are not certifiable. The results highlight that logged world feedback is a coverage-limited statistical resource: it may support empirical learning without supporting a valid tail-safety certificate.