Are Large Reasoning Models Interruptible?
Abstract
Real-world applications of Large Reasoning Models (LRMs) often require reasoning about changing prompts or environments. In this work, we evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the accuracy of model responses under budget-constrained outputs, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60\% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades when trying to incorporate updated information.