Timezone: »

Improved Estimator Selection for Off-Policy Evaluation
George Tucker

Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result, many estimators with different tradeoffs have been developed, however, selecting the best estimator is challenging with limited data and without additional interactive data collection. Recently, Su et al. (2020) developed a data-dependent selection procedure that competes with the oracle selection up to a constant and demonstrate its practicality. We refine the analysis to remove an extraneous assumption and improve the procedure. The improved procedure results in a tighter oracle bound and stronger empirical results on a contextual bandit task.

Author Information

George Tucker (Google Brain)

More from the Same Authors