Poster
in
Workshop: ICML 2024 Workshop on Foundation Models in the Wild
LoRD: Low-Rank Decomposition of Monolingual Code LLMs for One-Shot Compression
Ayush Kaushal · Tejas Vaidhya · Irina Rish
Keywords: [ Low rank decomposition ] [ Code LLMs ] [ Compression ] [ optimisation ]
We propose using low-rank matrix decomposition (LoRD), which splits a large matrix into a product of two smaller matrices, to compress neural network models and thereby enhance inference speed. Unlike quantization, LoRD maintains fully differentiable, trainable parameters and leverages efficient floating-point operations. We investigate its advantages for compressing Large Language Models (LLMs) for monolingual code generation, demonstrating that linear layer ranks can be reduced by up to 39.58% with less than a 1% increase in perplexity. Specifically, we use LoRD to compress the StarCoder 16B model to 13.2B parameters with no performance drop and to 12.3B parameters with minimal performance drop in the HumanEval Pass@1 score, all within 10 minutes on a single A100 GPU. The compressed models achieve up to a 22.35% inference speedup with just a single line of code change in HuggingFace’s implementation with Pytorch backend.