Coverage Cliffs in Learning from Logged World Feedback
Abstract
World feedback, including losses, costs, failures, latencies, health outcomes, or economic impacts measured from an environment, is a promising complement to human-preference feedback. Yet it is often logged under earlier policies, so its usefulness for a new policy is limited by off-policy coverage. We study this issue in logged contextual bandits. Instead of certifying mean loss, we ask when logged world feedback can certify upper-tail loss of a learned randomized policy. Our main result is a finite-sample PAC-Bayes certificate for posterior-averaged per-policy Value-at-Risk, obtained from posterior-uniform importance-weighted CDF control and quantile inversion. The analysis identifies a quantile-specific coverage cliff: if the logged feedback lacks enough weighted mass near the target quantile, the certified threshold must jump to a larger loss or become vacuous. Diagnostics show one theorem-covered continuous-outcome regime where the certificate is non-vacuous but conservative, and discrete or weak-coverage regimes where empirical improvements are not certifiable. The results highlight that logged world feedback is a coverage-limited statistical resource: it may support empirical learning without supporting a valid tail-safety certificate.