Timezone: »
Because state-of-the-art language model are expensive to train, most practitioners must make use of one of the few publicly available language models or language model APIs. This consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. We show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. We design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. Finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone.
Author Information
Nikhil Kandpal (University of North Carolina, Chapel Hill)
Matthew Jagielski (Google)
Florian Tramer (ETH Zurich)
Nicholas Carlini (Google DeepMind)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 : Backdoor Attacks for In-Context Learning with Language Models »
Dates n/a. Room
More from the Same Authors
-
2021 : Data Poisoning Won’t Save You From Facial Recognition »
Evani Radiya-Dixit · Florian Tramer -
2021 : Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them »
Florian Tramer -
2021 : Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples »
Maura Pintor · Luca Demetrio · Angelo Sotgiu · Giovanni Manca · Ambra Demontis · Nicholas Carlini · Battista Biggio · Fabio Roli -
2023 : Counterfactual Memorization in Neural Language Models »
Chiyuan Zhang · Daphne Ippolito · Katherine Lee · Matthew Jagielski · Florian Tramer · Nicholas Carlini -
2023 : Privacy Auditing with One (1) Training Run »
Thomas Steinke · Milad Nasresfahani · Matthew Jagielski -
2023 : Evading Black-box Classifiers Without Breaking Eggs »
Edoardo Debenedetti · Nicholas Carlini · Florian Tramer -
2023 Poster: Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models »
Nikhil Kandpal · Brian Lester · Mohammed Muqeeth · Anisha Mascarenhas · Monty Evans · Vishal Baskaran · Tenghao Huang · Haokun Liu · Colin Raffel -
2023 Poster: Preprocessors Matter! Realistic Decision-Based Attacks on Machine Learning Systems »
Chawin Sitawarin · Florian Tramer · Nicholas Carlini -
2023 Poster: Large Language Models Struggle to Learn Long-Tail Knowledge »
Nikhil Kandpal · Haikang Deng · Adam Roberts · Eric Wallace · Colin Raffel -
2022 Poster: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Spotlight: Deduplicating Training Data Mitigates Privacy Risks in Language Models »
Nikhil Kandpal · Eric Wallace · Colin Raffel -
2022 Poster: Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them »
Florian Tramer -
2022 Oral: Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them »
Florian Tramer -
2021 : Discussion Panel #2 »
Bo Li · Nicholas Carlini · Andrzej Banburski · Kamalika Chaudhuri · Will Xiao · Cihang Xie -
2021 : Invited Talk #7 »
Nicholas Carlini -
2021 : Contributed Talk #4 »
Florian Tramer -
2021 Poster: Label-Only Membership Inference Attacks »
Christopher Choquette-Choo · Florian Tramer · Nicholas Carlini · Nicolas Papernot -
2021 Spotlight: Label-Only Membership Inference Attacks »
Christopher Choquette-Choo · Florian Tramer · Nicholas Carlini · Nicolas Papernot -
2020 Poster: Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations »
Florian Tramer · Jens Behrmann · Nicholas Carlini · Nicolas Papernot · Joern-Henrik Jacobsen -
2019 Workshop: Workshop on the Security and Privacy of Machine Learning »
Nicolas Papernot · Florian Tramer · Bo Li · Dan Boneh · David Evans · Somesh Jha · Percy Liang · Patrick McDaniel · Jacob Steinhardt · Dawn Song