Skip to yearly menu bar Skip to main content

Workshop: Workshop on Reinforcement Learning Theory

Improved Estimator Selection for Off-Policy Evaluation

George Tucker


Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result, many estimators with different tradeoffs have been developed, however, selecting the best estimator is challenging with limited data and without additional interactive data collection. Recently, Su et al. (2020) developed a data-dependent selection procedure that competes with the oracle selection up to a constant and demonstrate its practicality. We refine the analysis to remove an extraneous assumption and improve the procedure. The improved procedure results in a tighter oracle bound and stronger empirical results on a contextual bandit task.

Chat is not available.