Test-Time Reinforcement Learning for Flow Matching
Abstract
Flow-matching has emerged as a leading framework for high-fidelity text-to-image generation. However, its alignment with human preferences through RL is often hindered by substantial computational overhead. In this paper, we introduce Flow-TTRL, the first test-time reinforcement learning framework that achieves alignment on the fly. Our approach reinterprets intermediate latent representations as an implicit policy and utilizes SDE-based rollouts to explore high-reward trajectories within the learned vector field. Specifically, we propose a two-stage optimization strategy: Proximal Reward Difference Prediction (PRDP) ensures structural stability in high-noise regimes through pairwise reward regression, while Group Relative Policy Optimization (GRPO) refines fine-grained aesthetic details by maximizing relative advantages within sampled candidate groups. Experimental results show that Flow-TTRL significantly boosts aesthetic quality, text-image alignment, and human preference across diverse backbones. On the GenEval benchmark, Flow-TTRL elevates the accuracy of SD 3.5-Medium from 63\% to 87\% and Flux.1 Dev from 66\% to 83\%. Furthermore, our framework achieves an average gain of 15\% to 20\% across T2I-CompBench metrics, delivering performance comparable to state-of-the-art RL-based fine-tuning methods without the need for additional fine-tuning.