Confidence is Not Universal: Task-Dependent Calibration and Emergent Behavior in LLMs
Abstract
Large language models (LLMs) increasingly support human decision-making, rendering human-interpretable confidence essential. However, it remains unclear whether verbalized confidence calibration generalizes across heterogeneous tasks without degrading accuracy. We show that universal confidence calibration fails. Across diverse benchmarks, we identify two incompatible task families with distinct confidence semantics. In reasoning-centric tasks, confidence supervision transfers within the family, often improving calibration while preserving or even improving accuracy, and induces emergent behaviors such as confidence-dependent reasoning length and self-verification. Retrieval- and copy-oriented tasks also exhibit within-family transfer, but fail to generalize to reasoning tasks, with cross-family supervision degrading both calibration and accuracy. Motivated by this finding, we disentangle confidence into reasoning uncertainty and evidence localization uncertainty. This simple decomposition restores cross-family generalization using supervised fine-tuning alone, suggesting that effective confidence alignment requires task-aware semantics rather than a universal scalar notion.