EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Abstract
Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. We propose a pipeline that combines automatic test generation using an agentic workflow with human-in-the-loop validation and revision to create a set of test cases for each set of collected user instructions and code contexts. EditBench covers a diverse set of real-world use cases, ranging from resolving errors to adding features. We evaluate 17 diverse LLMs and observe that EditBench is a challenging set of problems where even the best state-of-the-art models score less than 60%. Further, we find that model performance varies across different categories of user instructions, indicating room for improvement. We design the EditBench data pipeline to be a renewable pathway for live dataset construction that mitigates training data contamination.