Timezone: »

GPT-Zip: Deep Compression of Finetuned Large Language Models
Berivan Isik · Hermann Kumbong · Wanyi Ning · Xiaozhe Yao · Sanmi Koyejo · Ce Zhang
Event URL: https://openreview.net/forum?id=hO0c2tG2xL »
Storage is increasingly a practical bottleneck to scaling large language model (LLM) systems with personalization, co-location, and other use cases that require storing the pretrained base model plus multiple finetuned models. To this end, we propose GPT-Zip for post-finetuning compression. GPT-Zip uses quantization and sparsification to efficiently compress finetuned models by exploiting their closeness to the pretrained base model. Specifically, we demonstrate that the \emph{difference} between the finetuned models and the pretrained base model can efficiently be quantized into $2$ bits and pruned with $95 \%$ sparsity together -- providing up to $52$ times overall size reduction. Thus, GPT-Zip avoids the linear growth in memory costs required for naive storage. We show that this compression can be achieved without performance degradation, as measured by evaluations on several tasks from the Natural Instructions dataset. Surprisingly, GPT-Zip sometimes improves accuracy over uncompressed models. We demonstrate the efficacy of GPT-Zip on four finetuned OPT-1.3B models and show that GPT-Zip reduces the storage cost by $16$ times more than existing LLM compression techniques while attaining significantly better performance.

Author Information

Berivan Isik (Stanford University)
Hermann Kumbong (Stanford University)
Wanyi Ning (Beijing University of Posts and Telecommunications)
Xiaozhe Yao (Department of Computer Science, ETHZ - ETH Zurich)
Sanmi Koyejo (Stanford University)
Ce Zhang (ETH Zurich)

More from the Same Authors