Skip to yearly menu bar Skip to main content


( events)   Timezone:  
Poster
Wed Aug 09 01:30 AM -- 05:00 AM (PDT) @ Gallery #136
Counterfactual Data-Fusion for Online Reinforcement Learners
Andrew Forney · Judea Pearl · Elias Bareinboim

The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent's decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings' efficacy with extensive simulations.