Efficient DP-SGD for LLMs with Randomized Clipping
Enayat Ullah ⋅ Sai Aparna Aketi ⋅ Devansh Gupta ⋅ Huanyu Zhang ⋅ Meisam Razaviyayn
Abstract
Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on *fast gradient clipping* techniques with memory overhead $O(B\min(T^2, d^2))$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the layer width. This becomes prohibitive as both model width and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with *randomized clipping* that reduces memory and compute overhead. DP-SGD-RC leverages *stochastic trace estimation* methods, specifically *Hutchinson's estimator* and its improved variant, Hutch$^{++}$, to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama 3.2 1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute.
Successful Page Load