Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

GPT-Zip: Deep Compression of Finetuned Large Language Models

Berivan Isik · Hermann Kumbong · Wanyi Ning · Xiaozhe Yao · Sanmi Koyejo · Ce Zhang

Abstract: Storage is increasingly a practical bottleneck to scaling large language model (LLM) systems with personalization, co-location, and other use cases that require storing the pretrained base model plus multiple finetuned models. To this end, we propose GPT-Zip for post-finetuning compression. GPT-Zip uses quantization and sparsification to efficiently compress finetuned models by exploiting their closeness to the pretrained base model. Specifically, we demonstrate that the \emph{difference} between the finetuned models and the pretrained base model can efficiently be quantized into $2$ bits and pruned with $95 \%$ sparsity together -- providing up to $52$ times overall size reduction. Thus, GPT-Zip avoids the linear growth in memory costs required for naive storage. We show that this compression can be achieved without performance degradation, as measured by evaluations on several tasks from the Natural Instructions dataset. Surprisingly, GPT-Zip sometimes improves accuracy over uncompressed models. We demonstrate the efficacy of GPT-Zip on four finetuned OPT-1.3B models and show that GPT-Zip reduces the storage cost by $16$ times more than existing LLM compression techniques while attaining significantly better performance.

Chat is not available.