Timezone: »
Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on datasets that contain user-submitted examples, e.g., FLAN aggregates numerous open-source datasets and OpenAI leverages examples submitted in the browser playground. In this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input. To construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the LM. We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. Worryingly, we also show that larger LMs are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy. Notice: This paper contains tasks with obscene content.
Author Information
Alexander Wan (UC Berkeley)
Eric Wallace (U.C. Berkeley)
Sheng Shen (University of California, Berkeley)
Dan Klein (UC Berkeley)
More from the Same Authors
-
2021 : Learning Space Partitions for Path Planning »
Kevin Yang · Tianjun Zhang · Chris Cummins · Brandon Cui · Benoit Steiner · Linnan Wang · Joseph E Gonzalez · Dan Klein · Yuandong Tian -
2023 Poster: Large Language Models Struggle to Learn Long-Tail Knowledge »
Nikhil Kandpal · Haikang Deng · Adam Roberts · Eric Wallace · Colin Raffel -
2022 Poster: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Spotlight: Staged Training for Transformer Language Models »
Sheng Shen · Pete Walsh · Kurt Keutzer · Jesse Dodge · Matthew Peters · Iz Beltagy -
2022 Poster: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Spotlight: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Poster: Describing Differences between Text Distributions with Natural Language »
Ruiqi Zhong · Charlie Snell · Dan Klein · Jacob Steinhardt -
2022 Spotlight: Describing Differences between Text Distributions with Natural Language »
Ruiqi Zhong · Charlie Snell · Dan Klein · Jacob Steinhardt -
2021 Poster: Calibrate Before Use: Improving Few-shot Performance of Language Models »
Tony Z. Zhao · Eric Wallace · Shi Feng · Dan Klein · Sameer Singh -
2021 Oral: Calibrate Before Use: Improving Few-shot Performance of Language Models »
Tony Z. Zhao · Eric Wallace · Shi Feng · Dan Klein · Sameer Singh -
2020 Poster: PowerNorm: Rethinking Batch Normalization in Transformers »
Sheng Shen · Zhewei Yao · Amir Gholaminejad · Michael Mahoney · Kurt Keutzer -
2020 Poster: Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers »
Zhuohan Li · Eric Wallace · Sheng Shen · Kevin Lin · Kurt Keutzer · Dan Klein · Joseph Gonzalez -
2017 Poster: Modular Multitask Reinforcement Learning with Policy Sketches »
Jacob Andreas · Dan Klein · Sergey Levine -
2017 Talk: Modular Multitask Reinforcement Learning with Policy Sketches »
Jacob Andreas · Dan Klein · Sergey Levine