Optimizing Inference-Time Compute for Medical Reasoning via Uncertainty Quantification
Shaohao Rui ⋅ Kaitao Chen ⋅ Weijie Ma ⋅ Xiaosong Wang
Abstract
Extended Chain-of-Thought (CoT) reasoning has significantly bolstered the capabilities of medical large language models (LLMs). However, current models exhibit static computational expenditure, applying lengthy reasoning processes indiscriminately to both simple queries and complex diagnostic cases. This inefficiency is particularly prohibitive in real-world healthcare, where clinical scenarios range from time-sensitive emergencies requiring rapid response to intricate pathologies demanding deep analysis. To address this, we propose **AdaThink-Med**, an end-to-end framework for adaptive reasoning via uncertainty-guided length calibration. Although the underlying mechanism is generalizable, we demonstrate its critical value in the medical domain, where balancing inference latency with diagnostic precision is paramount. AdaThink-Med leverages entropy-based uncertainty estimation within reinforcement fine-tuning to dynamically shape reward signals: it penalizes verbosity for high-confidence correct answers (e.g., straightforward knowledge retrieval) while incentivizing extended exploration for uncertain or ambiguous scenarios. Across six medical benchmarks, AdaThink-Med reduces inference token consumption by $4.7\times$ to $6.4\times$ on Qwen and Llama architectures, respectively, with minimal performance trade-offs. Notably, the model spontaneously develops distinct "non-thinking'' and "thinking'' modes, demonstrating an autonomous ability to allocate computational resources efficiently based on clinical urgency and complexity.
Successful Page Load