Don't Overthink with Pixels: Efficient Reasoning for Segmentation
Abstract
Existing reasoning segmentation approaches typically fine-tune multimodal large language models (MLLMs) using image-text pairs and corresponding mask labels. While recent efforts leverage reinforcement fine-tuning to further enhance reasoning ability, they often suffer from overthinking and produce uniformly verbose reasoning chains irrespective of task complexity. To address this problem, we propose PixelThink, a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty to regulate reasoning generation within a reinforcement learning paradigm. The model learns to compress reasoning length in accordance with scene complexity and predictive confidence. To support comprehensive evaluation, we introduce ReasonSeg-Diff, an extended benchmark with annotated reasoning references and difficulty scores, along with a suite of metrics designed to assess segmentation accuracy, reasoning quality, and efficiency jointly. Experimental results demonstrate that the proposed approach not only improves segmentation performance but also significantly reduces inference latency by 30.4%, cutting token usage by 48.2%. The code and model will be publicly available.