Counterfactual Data-Fusion for Online Reinforcement Learners
Andrew Forney · Judea Pearl · Elias Bareinboim

Wed Aug 9th 04:24 -- 04:42 PM @ C4.1

The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent's decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings' efficacy with extensive simulations.

Author Information

Andrew Forney (UCLA)
Judea Pearl (UCLA)
Elias Bareinboim (Purdue)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors