Direct Flow Q-Learning
Abstract
Flow Matching shows great promise in offline reinforcement learning (RL), yet optimizing these iterative policies via Backpropagation Through Time (BPTT) is unstable. While prevailing paradigms circumvent this by distilling multi-step flows into single-step approximations, such methods may limit the benefits of iterative refinement. To avoid these sacrifices, we propose Direct Flow Q-Learning (DFQL), a streamlined framework that attains superior results by optimizing flow matching policies without BPTT or distillation. DFQL derives a surrogate objective that directly injects terminal Q-value gradients as a guidance term into each step velocity field, ensuring stable optimization while preserving iterative expressive capacity. Across 73 challenging tasks in OGBench and D4RL, DFQL achieves state-of-the-art results. Additionally, DFQL extends seamlessly to the offline-to-online setting, delivering substantial performance gains without further modification.