Poster
in
Workshop: Next Generation of AI Safety
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang · Kavel Rao · Seungju Han · Allyson Ettinger · Faeze Brahman · Sachin Kumar · Niloofar Mireshghallah · Ximing Lu · Maarten Sap · Nouha Dziri · Yejin Choi
Keywords: [ Safety Training ] [ Jailbreak ] [ Safety Training Data ] [ LLM Defense ] [ Red-teaming ] [ Adversarial Attacks ] [ AI Safety ] [ Adversarial training ]
We introduce WILDTEAMING, an automatic red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes selections of multiple tactics for systematic exploration of novel and challenging jailbreaks. WILDTEAMING reveals previously unidentified vulnerabilities of frontier LLMs, resulting in more diverse and successful adversarial attacks compared to state-of-the-art jailbreaking methods.With WILDTEAMING we create WILDJAILBREAK, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. In order to mitigate exaggerated safety behaviors, WILDJAILBREAK provides two contrastive types of queries: 1) harmful queries and 2) benign queries that resemble harmful queries in form but contain no harmful intent. Through extensive model training and evaluations, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of both vanilla and adversarial queries, and minimal, if any, decrease in general capabilities.