Skip to yearly menu bar Skip to main content


On the Design of Estimators for Bandit Off-Policy Evaluation

Nikos Vlassis · Aurelien Bibaut · Maria Dimakopoulou · Tony Jebara

Pacific Ballroom #129

Keywords: [ Bandits ]


Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets.

Live content is unavailable. Log in and register to view live content