firstbacksecondback
14 Results
Poster
|
Thu 4:30 |
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability Xingang Guo · Fangxu Yu · Huan Zhang · Lianhui Qin · Bin Hu |
|
Poster
|
Thu 2:30 |
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition Ziyang Zhang · Qizhen Zhang · Jakob Foerster |
|
Poster
|
Wed 4:30 |
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast Xiangming Gu · Xiaosen Zheng · Tianyu Pang · Chao Du · Qian Liu · Ye Wang · Jing Jiang · Min Lin |
|
Workshop
|
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models Liwei Jiang · Kavel Rao · Seungju Han · Allyson Ettinger · Faeze Brahman · Sachin Kumar · Niloofar Mireshghallah · Ximing Lu · Maarten Sap · Nouha Dziri · Yejin Choi |
||
Workshop
|
Weak-to-Strong Jailbreaking on Large Language Models Xuandong Zhao · Xianjun Yang · Tianyu Pang · Chao Du · Lei Li · Yu-Xiang Wang · William Wang |
||
Workshop
|
Merging Improves Self-Critique Against Jailbreak Attacks Victor Gallego |
||
Workshop
|
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Patrick Chao · Edoardo Debenedetti · Alex Robey · Maksym Andriushchenko · Francesco Croce · Vikash Sehwag · Edgar Dobriban · Nicolas Flammarion · George J. Pappas · Florian Tramer · Hamed Hassani · Eric Wong |
||
Workshop
|
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses Xiaosen Zheng · Tianyu Pang · Chao Du · Qian Liu · Jing Jiang · Min Lin |
||
Workshop
|
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko · Francesco Croce · Nicolas Flammarion |
||
Workshop
|
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Anay Mehrotra · Manolis Zampetakis · Paul Kassianik · Blaine Nelson · Hyrum Anderson · Yaron Singer · Amin Karbasi |
||
Workshop
|
Attacking Large Language Models with Projected Gradient Descent Simon Markus Geisler · Tom Wollschläger · M. Hesham Abdalla · Johannes Gasteiger · Stephan Günnemann |
||
Workshop
|
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs Valeriia Cherepanova · James Zou |