ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity
Abstract
For some time, LLMs have sped up life sciences research by synthesizing published literature, but now leading LLM-based tools can also perform certain \textit{in silico} tasks that had previously been the domain of experienced biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they are also changing the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of evaluations to measure \textit{agentic} biosecurity-relevant capabilities. ABC-Bench evaluates LLM-based agents on both benign and potentially harmful biosecurity-relevant tasks: writing code to operate liquid handling robots, designing DNA fragments for \textit{in vitro} assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise; indeed, when PhD biologists with at least two years of coding experience attempted the tasks in ABC-Bench, they scored only 24\% on average. By contrast, the top-performing LLM, Grok 3, achieves 53\% across tasks, outperforming 60\%, 100\%, and 54\% of experts on the Liquid Handling Robot, Fragment Design, and Screening Evasion tasks, respectively. In three additional experiments, we found that OpenAI's GPT-4o-mini-high produced code that, when run on an OpenTrons robot, successfully assembled DNA with the expected sequences.