What Does Flow-Matching Bring to TD-Learning?
Bhavya Agrawalla ⋅ Michal Nauman ⋅ Aviral Kumar
Abstract
Recent work shows that flow-matching networks can be effective for value function estimation in reinforcement learning, but it remains unclear why they work well or whether flow-matching Q-functions differ fundamentally from standard critics. We show that their success is not explained by distributional RL: explicitly modeling return distributions often degrades performance. Instead, we argue that flow-matching Q-functions are effective because they couple a learned velocity field with an integration procedure that is used both during training and to read out Q-values at inference time. This coupling enables robust value prediction through \emph{test-time recovery} from imperfect intermediate estimates where errors dampen out as more integration steps are performed. This mechanism is absent in monolithic critics. Beyond test-time recovery, training with the integration procedure induces more \emph{plastic} representations, allowing critics to represent non-stationary future TD targets without overwriting previous features. We formalize these effects and validate them empirically, showing that flow-matching critics outperform monolithic critics by over $2\times$ in performance and achieve $5$–$10\times$ higher sample efficiency in high-UTD regimes.
Successful Page Load