Benchmarking and Evolving Reason-Reflect-Rectify for Reflective Visual Generation
Junjie Wang ⋅ 星华 娄 ⋅ Xiangtai Li ⋅ Ye Tian ⋅ Keyu Chen ⋅ Yulin Li ⋅ Bin Kang ⋅ Guangcan Mai ⋅ Yanwei Li ⋅ Zhuotao Tian ⋅ Liqiang Nie
Abstract
Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R$^3$) loop as a core framework and introduce R$^3$-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R$^3$-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R$^3$-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with relective reasoning. Experiments show that R$^3$-Refiner achieves significant improvements on R$^3$-Bench (+12.0 \% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Our code, models, and data will be publicly available.
Successful Page Load