Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference

Oshin Dutta · Ritvik Gupta · Sumeet Agarwal


Abstract:

Structured pruning removes entire components like attention heads or layers to yield faster dense models. However, previous methods require significant pruning time and and overlook token embedding dimension, missing potential inference acceleration. Moreover, pruning of heads in grouped query attentions is not widely attempted due to challenges with their interdependencies. To address these limitations, we propose a structured pruning method for LLMs that incorporates the concept of Variational Information Bottleneck (VIB) to obtain compressed representations at each structural element while preserving essential information for accurate prediction. We enhance the formulation to account for all preceding layers' influence on the current compressed representation, ensuring a globally informed reduction in token embedding dimension and grouped query not explored in previous work. Additionally, we include a post-pruning step to adjust the pruned model dimensions, ensuring optimal use of Tensor Cores in GPUs which speeds up inference by upto 60%. Evaluated on several language benchmarks using variants of LLaMA models and Mistral, our method shows a reduction in pruning time by upto 90% with higher inference speed and competitive performance.

Chat is not available.