Skip to yearly menu bar Skip to main content


Poster

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika · Long Phan · Xuwang Yin · Andy Zou · Zifan Wang · Norman Mu · Elham Sakhaee · Nathaniel Li · Steven Basart · Bo Li · David Forsyth · Dan Hendrycks


Abstract:

In the evolving landscape of AI safety, the promise of automated red teaming methods to identify and mitigate the risks of malicious use of large language models (LLMs) is met with the challenge of lacking a standardized framework for their evaluation. We introduce HarmBench, a comprehensive benchmark designed to standardize the assessment of these methods across a wide spectrum of harmful behaviors. Through extensive experimentation, we unveil the nuanced effectiveness of existing red teaming approaches and defenses, underscoring the complexity of ensuring LLM safety. Our contribution includes the development of Robust Refusal Defensive Drills (R2D2), a novel adversarial training technique that notably enhances LLMs' resistance to adversarial attacks, marking a significant step towards safer AI deployment.

Live content is unavailable. Log in and register to view live content