Skip to yearly menu bar Skip to main content


Poster

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou · Akshat Shrivastava · Hongyuan Zhan · Trang Le · Rylan Conway · Adithya Sagar · Giulia Fanti · Daniel Lazar


Abstract: On-device training is the most common way to use private user data to train machine learning (ML) models.This has major drawbacks: (1) user devices are too small to train large models on-device, (2) it is communication and computation intensive for users, and (3) it can be hard to deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under the high privacy regime ($\epsilon = 1.29$). We achieve these results while using 6x less total client computation and 40x less communication than on-device training. Second, finetuning large models on PrE-Text DP synthetic data improves LLM performance on private data across a range of privacy budgets; we observe up to 8\% reduction in cross-entropy loss compared to a pretrained LLM's non-finetuned (on private data) performance. Altogether, these results suggest in some settings, training on DP synthetic data is a better option than training model on-device on private distributed data.

Live content is unavailable. Log in and register to view live content