Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Reinforcement Learning for Real Life

Designing Online Advertisements via Bandit and Reinforcement Learning

Richard Liu · Yusuke Narita · Kohei Yata


Abstract: Efficient methods to evaluate new algorithms are critical for improving reinforcement learning systems such as ad recommendation systems. A/B tests are reliable, but are time- and money-consuming, and entail a risk of failure. In this paper, we develop a new method of \textit{off-policy evaluation}, predicting the performance of an algorithm given historical data generated by a different algorithm. Our estimator converges in probability to the true value of a counterfactual algorithm at a rate of $\sqrt{N}$. We also show how to correctly estimate the variance of our estimator. In a special-case setting which covers contextual bandits, we show that our estimator achieves the lowest variance among a wide class of estimators. These properties hold even when the analyst does not know which among a large number of state variables are actually important, or when the baseline policy is unknown. We validate our method with a simulation experiment and on real-world data from a major advertisement company. We apply our method to improve an ad policy for the aforementioned company. We find that our method produces smaller mean squared errors than state-of-the-art methods.

Chat is not available.