Towards a Science of AI Agent Reliability
Abstract
AI agents are increasingly deployed for consequential tasks. Yet existing benchmarks evaluate only task success rates, ignoring whether agents behave consistently, remain robust to perturbations, fail predictably, or bound error severity. We propose a framework for measuring agent reliability grounded in safety-critical engineering practice, decomposing reliability into four dimensions: consistency, robustness, predictability, and safety. Applying these metrics to 12 frontier models across two complementary benchmarks, we find that recent capability gains have produced minimal improvement in reliability: agents remain inconsistent across runs, brittle to prompt rephrasings, and poorly calibrated in their self-assessments, even as accuracy improves. Our metrics complement accuracy-focused evaluation by offering tools for reasoning about how agents perform, degrade, and fail under uncertainty.