Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis
Abstract
The advancement of Medical Vision-Language Models (VLMs) for 3D Computed Tomography (CT) analysis is hindered by a misalignment between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms rely on lexical proxy signals that induce \textbf{evaluation hallucinations}'', where models prioritize linguistic fluency over factual accuracy, leading to fatal clinical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable semantic units. Using CABS, we identify a\textbf{mechanistic divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs.