Poster
Off-Policy Confidence Sequences
Nikos Karampatziakis · Paul Mineiro · Aaditya Ramdas
Virtual
Keywords: [ Reinforcement Learning and Planning ] [ Bandits ]
We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the ``gated deployment'' problem of safely upgrading a production contextual bandit system.