How Do Large Language Monkeys Get Their Power (Laws)?
Abstract
Lay Summary
Recent research has shown a curious pattern: when language AIs are given multiple tries at a set of tasks, their overall success improves according to a "power law"—a predictable, but not overly fast, curve. This was puzzling because, for any single task, more tries should make success much more likely, very quickly (exponentially). Our work solves this by showing that while individual tasks do follow this rapid improvement, the overall power law emerges due to how task difficulties are spread. Specifically, a small number of extremely hard tasks, where the AI has a tiny chance of success on any single attempt, collectively slow down the average improvement to a power law, even as each problem is still being tackled exponentially faster with more tries. Understanding this allows us to explain why some AI models or tasks don't follow this power law (they lack enough super-hard problems) and, more importantly, lets us predict this scaling behavior much more efficiently, using far less computational power, simply by looking at the initial success rates, especially on those toughest challenges.