FineFocus: Benchmarking and Improving Fine-Grained Text-to-Image Alignment via Paired Reinforcement Learning
Abstract
While recent autoregressive models have achieved text-to-image generation performance comparable to diffusion models, they significantly struggle with fine-grained semantic alignment. To rigorously evaluate this limitation, we introduce DeltaBench, a benchmark featuring paired prompts with subtle fine-grained differences, which reveals that existing models fail to achieve precise control over visual tokens. To bridge this gap, we propose FineFocus, a comprehensive framework that enhances alignment by learning from subtle differences in similar text-image pairs. Specifically, we construct FineFocus-Data, a large-scale dataset of paired samples derived from image editing tasks to capture localized semantic shifts. Furthermore, we introduce Pair-GRPO, an improved reinforcement learning algorithm that extends GRPO to paired samples. Extensive experiments demonstrate that our approach outperforms most prior prominent methods on both DeltaBench and existing benchmarks.