Sample Efficient Full-Finetuning of Generative Control Policies
Abstract
Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. Yet there remains substantial debate over how to sample efficiently fine-tune them via reinforcement learning. A prevailing view holds that fine-tuning all GCP steps is unnecessary, motivating approaches that fine-tune only a subset of the generative process: either steering the initial noise distribution or learning residual corrections on top of a frozen base policy. In this work, we introduce Off-policy Generative Policy Optimization (\OGPO{}), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. \OGPO{} achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can \emph{fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer}, and does so with \emph{few task-specific hyperparameter tuning}. We perform extensive empirical investigations on \OGPO{}, finding that its superior performance and sample efficiency lie in its ability to learn beyond the action distribution of the pre-trained base policy, and propose practical implementation details that further boost performance for more complex scenarios.