Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation
Dhatri C ⋅ Tadisetty S Yashwanth
Abstract
Most LLM agent benchmarks score the agent’s output description, not its effect on the world. We introduce TraceRoot, a microservice incident benchmark where an agent investigates structured logs, edits the faulty source file, and has its patch verified by an end-to-end reproducer with no LLM judge. Across six frontier models and five inci- dents, three reasoning-enabled models achieve 5/5 pass rates; three non-reasoning models score 1/5, failing primarily through non-termination, a failure mode invisible to report-based evaluators. We release all artifacts to support reproducible evaluation.
Successful Page Load