Disentangling a Large Language Model’s Computation from its Chain-of-Thought
Abstract
Do the chains of thought (CoT) of reasoning Large Language Models (LLMs) reflect their internal computation? In this paper, we provide evidence of \textit{performative} CoT, where a model becomes strongly confident in its final answer, but continues generating excess tokens without revealing its internal belief. Our analysis compares activation probing of the model's final answer and early forced answering to a CoT monitor across two large reasoning models (DeepSeek-R1 671B & GPT-OSS 120B). We observe difficulty-specific differences for these methods: the gap between the expressed CoT and the model's internal belief is larger for MMLU-Redux questions that are easier and recall-based, and is smaller on more difficult multihop GPQA-Diamond questions. We also study certain inflection points within individual reasoning traces, finding that they correspond to updates in probe confidence. Finally, we leverage our probes to enable confidence-based early exit from CoT that saves up to 80\% of tokens on MMLU and 30\% of tokens on GPQA while maintaining similar accuracy. This work provides nuance to discussions on CoT faithfulness, and establishes attention probing as an efficient method for detecting performative reasoning and for adaptive computation in reasoning LLMs.