Spotlight Poster

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk ⋅ Tao Lin ⋅ Joel Becker ⋅ Sami Jawhar ⋅ Neev Parikh ⋅ Thomas Broadley ⋅ Lawrence Chan ⋅ Michael Chen ⋅ Joshua Clymer ⋅ Jai Dhyani ⋅ Elena Ericheva ⋅ Katharyn Garcia ⋅ Brian Goodrich ⋅ Nikola Jurkovic ⋅ Megan Kinniment ⋅ Aron Lajko ⋅ Seraphina Nix ⋅ Lucas Jun Koba Sato ⋅ William Saunders ⋅ Maksym Taran ⋅ Ben West ⋅ Elizabeth Barnes

2025 Spotlight Poster

[ Poster] [ OpenReview]

Abstract

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).

Lay Summary

We think that AI models will soon be extremely transformative in a very short amount of time. A key capability AIs will have is the ability to do the kind of research that goes into developing better AIs. We wanted to create tasks for AIs that measure how well they can do specific parts of such research. In order to understand how well they end up doing, we compare them to human experts, who are asked to do these tasks in the exact same environment and setup. We analyze the results comparing humans to AIs and discuss what factors might contribute or affect these results.

Video

Chat is not available.