Base Models Know How to Reason, Thinking Models Learn When
Abstract
Why do thinking language models outperform their base counterparts, and what exactly do they learn during training? We introduce constructive model diffing, a framework for understanding fine-tuned models by explicitly constructing the base-to-fine-tuned difference from interpretable components to produce hybrid models, and measuring how well they recover the fine-tuned model's performance. For thinking models, we decompose the diff into two components: reasoning mechanisms (steering vectors that activate specific behaviors in the base model) and reasoning heuristics (a classifier that determines when each mechanism should fire). To ground this decomposition, we develop an unsupervised methodology using Sparse Autoencoders to discover interpretable taxonomies of reasoning behaviors. Evaluating nine model configurations (five RL-trained, four distilled), we find a striking difference between training methods: the hybrid models for the five RL-trained models achieve much higher performance recovery compared to the four distilled models. This indicates RL-trained models primarily learn sophisticated heuristics for deploying pre-existing base model mechanisms, while distillation affects the mechanisms themselves. These results provide a new lens for understanding what different training paradigms teach, with potential implications for efficient reasoning model development.