Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #1409

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong ⋅ Mushui Liu ⋅ Canyu Zhao ⋅ Wanggui He ⋅ Shiyi Zhang ⋅ Peng Zhang ⋅ Hongwei Zhang ⋅ Jinlong Liu ⋅ Hao Jiang

Project Page

Abstract

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action’s ``pure" effect, and (ii) it identifies turning points—steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend—and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation.