Poster
in
Workshop: Models of Human Feedback for AI Alignment
Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pankayaraj Pathmanathan · Souradip Chakraborty · Xiangyu Liu · Yongyuan Liang · Furong Huang
Recent advancements in Reinforcement Learning with Human Feedback (RLHF)have significantly impacted the alignment of Large Language Models (LLMs).The sensitivity of reinforcement learning algorithms such as Proximal PolicyOptimization (PPO) has led to new line work on Direct Policy Optimization (DPO),which treats RLHF in a supervised learning framework. The increased practicaluse of these RLHF methods warrants an analysis of their vulnerabilities. In thiswork, we investigate the vulnerabilities of DPO to poisoning attacks under differentscenarios and compare the effectiveness of preference poisoning, a first of its kind.We comprehensively analyze DPO’s vulnerabilities under different types of attacks,i.e., backdoor and non-backdoor attacks, and different poisoning methods across awide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. Wefind that unlike PPO-based methods, which, when it comes to backdoor attacks,require at least 4% of the data to be poisoned to elicit harmful behavior, we exploitthe true vulnerabilities of DPO more simply so we can poison the model withonly as much as 0.5% of the data. We further investigate the potential reasonsbehind the vulnerability and how well this vulnerability translates into backdoor vsnon-backdoor attacks