ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State
Abstract
LLM agents are increasingly deployed in long-running, user-facing settings where execution can be broken by user interventions, tool failures, and context-management constraints. Yet standard agent benchmarks mostly evaluate uninterrupted runs, leaving recovery behavior largely unmeasured. We introduce ContinuityBench, a benchmark-agnostic framework that turns step-based agent benchmarks into controlled tests of continuity under interruption: it runs an uninterrupted baseline, interrupts execution at controlled points, resumes the same partially completed task from the live environment state, and varies the handoff signal across three fidelity levels - h0 with no prior context, h1 with a structured summary, and h2 with summary plus full action history - while preserving the source benchmark's native evaluator. Instantiating ContinuityBench on tau-bench, AppWorld, and TerminalBench with GPT-5.1 and Gemini 3 Flash, we find that interruption drops average task success from 41.7% to 28.0%. Handoff fidelity is non-monotonic: h1 outperforms h2 in 11 of 18 benchmark/model/interruption settings. Trace analysis shows distinct recovery failures: conversational frame drift, recovery churn, and over-steering from richer context. These results identify resumption as a measurable axis of agent reliability that aggregate task-success metrics miss, and show that more handoff context is not automatically better.