Preference-Enhanced Reinforcement Learning for Pluralistic Image Inpainting
Abstract
Existing image inpainting frameworks rely on strictly supervised training paradigms, often suffering from an over-reliance on ground-truth reconstruction, which leads to conservative outputs with misaligned creativity and limited diversity. To this end, we propose the first framework to explore Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for text-guided image inpainting, formulating an efficient online reinforcement learning pipeline that enables flexible, human-aligned aesthetic control via a preference scoring model. Crucially, by decoupling the rigid one-to-one correspondence between text prompts and masked images, our method enables the model to explore diverse, controllable, and high-quality solutions beyond a single target. Furthermore, to balance semantic consistency with physical naturalness at mask boundaries, we introduce a scale-aware dynamic reward mechanism that adaptively emphasizes boundary gradient coherence for small occlusions while prioritizing visual aesthetics in large-scale generation. Extensive experiments demonstrate that our approach consistently produces higher-quality results across different backbone architectures such as Stable Diffusion and FLUX, significantly enhancing the generative capacity of base models. Code is available at https://anonymous.4open.science/r/E3F47R.