Off-Policy Confidence Sequences

Nikos Karampatziakis · Paul Mineiro · Aaditya Ramdas

[ Abstract ] [ Livestream: Visit Learning Theory 14 ] [ Paper ]
[ Paper ]

We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the ``gated deployment'' problem of safely upgrading a production contextual bandit system.

Chat is not available.