Target-Driven Policy Optimization for Sequential Counterfactual Outcome Control
Abstract
Identifying optimal intervention sequences from offline data to guide temporal systems toward target outcomes is a critical challenge with profound implications for fields like personalized medicine. While existing methods are mostly evaluated in offline settings, practical applications demand online, adaptive strategies that can respond in real-time. To address this, we propose \textbf{G}oal-conditioned \textbf{I}ntervention via \textbf{F}actual-\textbf{T}argeted Training (\textbf{GIFT}), a novel framework for learning sequential intervention policies from observational data. GIFT learns a goal-conditioned policy by rescaling rewards with clipped importance weights, stabilizing learning and steering toward the target. Under standard assumptions, the induced operator has a unique fixed point and our procedure converges to it. We also bound the bias from clipping and approximation via the gap to the policy’s true value. Experiments show GIFT significantly outperforms existing methods in creating goal-conditioned policies for online deployment.