Poster
in
Workshop: 2nd ICML Workshop on New Frontiers in Adversarial Machine Learning

Teach GPT To Phish

Ashwinee Panda · Zhengming Zhang · Yaoqing Yang · Prateek Mittal

Keywords: large language models memorization Privacy data poisoning LLMs privacy risks federated learning Machine Learning

Project Page [ OpenReview]

Abstract

Quantifying privacy risks in large language models (LLM) is an important research question. We take a step towards answering this question by defining a real-world threat model wherein an entity seeks to augment an LLM with private data they possess via fine-tuning.The entity also seeks to improve the quality of its LLM outputs over time by learning from human feedback.We propose a novel phishing attack', a data extraction attack on this system where an attacker uses blind data poisoning, to induce the model to memorize the association between a given prompt and somesecret' privately held data.We validate that across multiple scales of LLMs and data modalities, an attacker can inject prompts into a training dataset that induce the model to memorize a `secret' that is unknown to the attacker, and easily extract this memorized secret.

Chat is not available.