Oral

Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

Michael Oberst ⋅ David Sontag

2019 Oral

[ Slides] [ Video]

Abstract

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy ``debugging'' in high-risk settings (e.g., healthcare); by decomposing the expected reward under the RL policy into specific episodes, we can identify groups where it is more likely to dramatically under- or over-perform the observed policy. This in turn can be used to facilitate review of specific episodes by domain experts, as well as to guide data collection (e.g., to characterize patient sub-types). We demonstrate the utility of this procedure in the setting of the management of sepsis.

Chat is not available.