iGRPO: Fast Online RL for Flow Matching Model with Dense Reward
Abstract
Conventional practice assumes that online reinforcement learning for flow-matching models requires sampling full denoising trajectories to compute rewards. This assumption underlies methods such as Group Relative Policy Optimization (GRPO), where the policy must traverse the entire reverse process before receiving a delayed, trajectory-level reward. We observe, however, that while such terminal rewards provide feedback, they are neither necessary nor optimal for effective learning. In this work, we introduce iGRPO (Instant-reward GRPO), which replaces GRPO's full-trajectory rollouts with a single-step mapping that assigns rewards instantly at each denoising step. Because the flow matching model behaves differently across timesteps, our step-local instant rewards which are inherently time-dependent, overcome prior approaches that rely on a single, time-independent terminal reward. By evaluating each action locally rather than relying on a final terminal score, iGRPO eliminates the need for multi-step SDE rollouts and offers more precise credit assignment. Across standard benchmarks, iGRPO converges 10.2× faster than FlowGRPO while achieving higher final alignment quality. We hope this work motivates more efficient and scalable online RL methods for flow-matching generative models.