Active Policy Optimization for Individualized Dosing via Gradient Variance Minimization
Abstract
In domains such as healthcare and marketing, learning optimal individualized dosing policies to maximize utility is crucial, yet high experimental costs impose strict budget constraints, necessitating efficient active policy learning. Existing active learning methods in causal inference primarily focus on binary treatments and effect estimation, leaving continuous dosing and policy optimization underexplored. To address this gap, we propose an active learning framework tailored for optimal policy learning. Exploiting the inherent structure of dose-response curves, we theoretically show that the policy optimization regret is bounded by the expected posterior gradient variance at the estimated optimal doses. Motivated by this result, we introduce Gradient Variance Active Learning for Individualized Dosing (GVALID), a batch acquisition strategy that greedily selects samples to minimize target gradient variance for efficient policy learning. Experiments demonstrate that GVALID achieves superior performance under strict budget constraints.