AIE-Bench: Benchmarking Agents That Build Agents
Abstract
We introduce AIE Bench, a benchmark for measuring how well AI agents can build and improve other AI agents. Existing benchmarks evaluate whether an agent can solve tasks. This benchmark aims to measure whether an agent can modify another agent to make it better at those tasks. AIE Bench is built around two roles. A meta-agent proposes modifications, and a target-agent that is being improved. This setup covers meta-improvement, where one agent improves another, and self-improvement, where an agent improves itself. We instantiate AIE Bench across two task families spanning terminal interaction and tool calling, and we evaluate frontier agentic systems on their ability to drive gains through iterative modification. AIE Bench aims to make recursive agent improvement a measurable and reproducible research target.