Poster
in
Workshop: Reinforcement Learning for Real Life
Designing Online Advertisements via Bandit and Reinforcement Learning
Richard Liu · Yusuke Narita · Kohei Yata
Abstract:
Efficient methods to evaluate new algorithms are critical for improving reinforcement learning systems such as ad recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop a new method of \textit{off-policy evaluation}, predicting the performance of an algorithm given historical data generated by a different algorithm. Our estimator converges in probability to the true value of a counterfactual algorithm at a rate of $\sqrt{N}$. We also show how to correctly estimate the variance of our estimator. In a special-case setting which covers contextual bandits, we show that our estimator achieves the lowest variance among a wide class of estimators. These properties hold even when the analyst does not know which among a large number of state variables are actually important, or when the baseline policy is unknown. We validate our method with a simulation experiment and on real-world data from a major advertisement company. We apply our method to improve an ad policy for the aforementioned company. We find that our method produces smaller mean squared errors than state-of-the-art methods.
Chat is not available.