Poster
in
Workshop: Foundations of Reinforcement Learning and Control: Connections and Perspectives

Partial Structure Discovery is Sufficient for No-regret Learning in Causal Bandits

Muhammad Qasim Elahi · Mahsa Ghasemi · Murat Kocaoglu

Project Page [ OpenReview]

Abstract

Causal knowledge about the relationships among the decision variables and a reward variable in a bandit setting can accelerate the learning of an optimal decision, as substantiated by studies on causal bandits. Current works often assume the causal graph is known; nevertheless, such knowledge may not be available a priori. Motivated by this challenge, we focus on the causal bandit problem in scenarios where the underlying causal graph is unknown and further, it may include latent confounders. While intervention on the parents of the reward node is optimal in the absence of latent confounders, that is not necessarily the case in general. Instead, one has to consider a set of possibly optimal arms/interventions, each being a special subset of ancestors of the reward node--- deeming causal discovery beyond parents of the reward node imperative. However, for regret minimization, we identify that discovering the full causal structure is not necessary; yet there is no work that provides the necessary and sufficient components of the causal graph. We formally characterize the set of necessary and sufficient latent confounders one needs to detect/learn to ensure all the possibly optimal arms are learned correctly. We also propose a randomized algorithm for learning the causal graph with a limited number of samples and provide a sample complexity guarantee for any desired confidence level. For the causal bandit setup, we propose a two-stage approach. In the first stage, we learn the induced subgraph on ancestors of the reward along with a necessary and sufficient subset of latent confounders to construct the set of possibly optimal arms. The next phase involves the application of a standard bandit algorithm, such as the UCB algorithm. We also establish regret bound for our two-phase approach, which is sublinear in the number of rounds and has polynomial scaling with respect to the number of nodes in the graph

Chat is not available.