B-Spar: Bayesian Sparse-Reward Modeling for RL-based Image Editing
Abstract
Autonomous image-editing agents powered by multimodal large language models (MLLMs) improve transparency and controllability by translating high-level instructions into tool-mediated edit sequences, but training such agents with reinforcement learning often relies on dense proxy rewards (e.g., incremental image-quality score gains) to compensate for sparse human feedback. When these proxies overvalue small local changes, the resulting optimization signal can be dominated by numerically measurable yet perceptually negligible edits, biasing policy gradients toward proxy artifacts rather than meaningful progress. We propose B-Spar, a reward-centric Reinforcement Learning framework for perceptually aligned image retouching under sparse feedback that combines prior-guided trajectory sampling to reduce inefficient exploration, Bayesian reward modeling to densify sparse binary feedback into a stable training signal, and anchor-regularized policy optimization to steer updates toward high-reward regions while preventing early mode collapse. Experiments on public benchmarks demonstrate that B-Spar improves perceptual quality and metric alignment with stable training and competitive inference efficiency over strong prompt-based and training-based baselines. Notably, it outperforms AIGC-based baselines by over 95\% in perceptual quality, achieving an improvement of approximately 33.5\% over the state-of-the-art.