Skip to yearly menu bar Skip to main content


Off-Policy Confidence Sequences

Nikos Karampatziakis · Paul Mineiro · Aaditya Ramdas


Keywords: [ Bandits ] [ Reinforcement Learning and Planning ]


We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the ``gated deployment'' problem of safely upgrading a production contextual bandit system.

Chat is not available.