Reasoning Models Struggle to Control their Chains of Thought
Abstract
Instruction following in LLMs captures models' ability to change their visible behaviors as requested by users. Instead, we study models' ability to control their chain-of-thought (CoT). This capability -- CoT controllability -- is undesirable because it could allow models to suppress signs of misbehavior in their CoT, thereby undermining our ability to monitor them. To measure this, we introduce the \emph{CoT-Control} evaluation suite. We show that reasoning models are less able to follow instructions in their CoT than in their outputs: on instructions like reasoning about a genetics problem without mentioning the word ``chromosome", Claude-Sonnet-4.5 complies only 5\% of the time. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models that are told they're being monitored exhibit slightly higher controllability. Similarly, eliciting controllability by adversarially optimizing prompts doesn’t meaningfully increase controllability. Our results leave us cautiously optimistic: reasoning models generally seem characterized by low CoT controllability. However, the mechanism behind this phenomenon is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs keep tracking controllability for future models.