Overthinking: Amplifying Reasoning Weights to Extract Learned Secrets
Jack Hopkins ⋅ Dipika Khullar ⋅ Fabien Roger
Abstract
Black box auditing of language models is an essential pre-deployment tool, but it may miss subtle forms of misalignment and hidden information. To better elicit hidden information during an auditing process, we introduce *overthinking*: the process of using reasoning task vectors to amplify the chain-of-thought faithfulness of reasoning models. Given the parameters of a base instruct model M and reasoning-distilled model R, we define the *overthinking model* as $\mathcal{O}_\alpha = M + \alpha(R - M)$, where $\alpha > 1$ amplifies reasoning beyond the pure reasoning model R. Additionally, we introduce layer-wise attenuation strategies that selectively amplify reasoning without losing quality and coherence of model outputs. We demonstrate that overthinking models are more likely to reveal hidden information across four experimental settings, across 2B-32B models. Our findings suggest that reasoning amplification may surface secrets or unintended behaviors acquired during training up to 10 times more frequently than the original reasoning model.
Successful Page Load