Escaping the Likelihood Trap: Geometric Diversity Optimization for Long-Form Image Captioning
Abstract
The utility of Vision-Language Models (VLMs) in reasoning and auditing tasks hinges on their ability to exhaustively describe visual scenes. However, current models exhibit a pathology we term the Likelihood Trap: standard alignment objectives, specifically MLE and KL-regularization, drive generation toward generic, high-probability templates, systematically suppressing fine-grained details. To overcome this, we introduce Geo-RL, a framework that shifts the objective from probabilistic likelihood to geometric coverage. Geo-RL reformulates caption generation as maximizing the volume of a parallelotope in semantic space. By leveraging Determinantal Point Processes (DPPs), we enforce orthogonality among sampled descriptions, ensuring that they span the image's full semantic support. Crucially, we derive a closed-form leave-one-out marginal reward, enabling stable policy optimization. Empirically, Geo-RL escapes the trap, achieving a significant improvement in semantic richness and detail coverage without compromising visual grounding.