Skip to yearly menu bar Skip to main content


Poster

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning

Nathaniel Li · Alexander Pan · Anjali Gopal · Summer Yue · Daniel Berrios · Alice Gatti · Justin Li · Ann-Kathrin Dombrowski · Shashwat Goel · Gabriel Mukobi · Nathan Helm-Burger · Rassin Lababidi · Lennart Justen · Andrew Liu · Michael Chen · Isabelle Barrass · Oliver Zhang · Xiaoyuan Zhu · Rishub Tamirisa · Bhrugu Bharathi · Ariel Herbert-Voss · Cort Breuer · Andy Zou · Mantas Mazeika · Zifan Wang · Palash Oswal · Weiran Lin · Adam Hunt · Justin Tienken-Harder · Kevin Shih · Kemper Talley · John Guan · Ian Steneker · David Campbell · Brad Jokubaitis · Steven Basart · Stephen Fitz · Ponnurangam Kumaraguru · Kallol Karmakar · Uday Tupakula · Vijay Varadharajan · Yan Shoshitaishvili · Jimmy Ba · Kevin Esvelt · Alexandr Wang · Dan Hendrycks


Abstract:

The White House Executive Order on Artificial Intelligence highlights the future risks of large language models (LLMs) in facilitating the development of bioweapons and cyberweapons. As models improve, it will become important to both measure and reduce the risk from malicious use. Unfortunately, evaluation of such hazardous emergent capabilities is primarily manual (e.g., testing if models can hack a remote server) and private (carried out by contractors or model developers). This evaluation limits scientific inquiry, lacks a continuous measure of risk, and most importantly fails to provide a guide for risk mitigation. In this work, we fulfill this gap with WMDP, a dataset of 1,426 multiple-choice questions surrounding hazardous knowledge in biosecurity and cybersecurity costing over $200K to gather. WMDP is collected by a consortium of academics and technical consultants. We publicly release WMDP as both as a proxy evaluation for hazardous capabilities in LLMs, and as a benchmark for machine unlearning methods in removing hazardous knowledge. To guide progress on unlearning hazardous capabilities, we develop CREFT, a state-of-the-art unlearning method. We find that CREFT reduces model performance on WMDP while maintaining other general model capabilities, suggesting that unlearning hazardous knowledge is a concrete path towards reducing misuse risk from LLMs.

Live content is unavailable. Log in and register to view live content