Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi ⋅ Valerie Chen ⋅ Ryan Shar ⋅ Aditya Mittal ⋅ Jenny Liang ⋅ Wei-Lin Chiang ⋅ Anastasios Angelopoulos ⋅ Ion Stoica ⋅ Graham Neubig ⋅ Ameet Talwalkar ⋅ Chris Donahue

Keywords: real-world edit code edit llm code

Project Page [ OpenReview]

Abstract

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. We propose a pipeline that combines automatic test generation using an agentic workflow with human-in-the-loop validation and revision to create a set of test cases for each set of collected user instructions and code contexts. EditBench covers a diverse set of real-world use cases, ranging from resolving errors to adding features. We evaluate 17 diverse LLMs and observe that EditBench is a challenging set of problems where even the best state-of-the-art models score less than 60%. Further, we find that model performance varies across different categories of user instructions, indicating room for improvement. We design the EditBench data pipeline to be a renewable pathway for live dataset construction that mitigates training data contamination.

Chat is not available.