SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Abstract
Lay Summary
Large language models like ChatGPT are getting better at coding, but could they perform real software engineering work? The SWE-Lancer benchmark aims to measure this. We collected nearly 1500 actual freelance software engineering jobs from Upwork that are worth $1 million USD in total payouts, and measured model performance on these tasks. SWE-Lancer tasks consist of both independent engineering work — like bug fixes and feature implementations — and managerial tasks, where models act as a tech lead and pick the best implementation solution for a given problem. We evaluated multiple state-of-the-art models, and all were unable to solve the majority of tasks, indicating they do not surpass human performance on real-world software engineering work. To facilitate future research, we open-sourced our evaluation code and a public evaluation split so anyone can run SWE-Lancer. By mapping model performance to real dollars, we hope SWE-Lancer enables greater research into the economic impact of AI model development.