Skip to yearly menu bar Skip to main content

Workshop: ES-FoMo: Efficient Systems for Foundation Models

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun · Zhuang Liu · Anna Bair · Zico Kolter


As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning. Existing methods require either retraining or solving a weight reconstruction problem, which may be computationally expensive for billion-scale LLMs. In this paper, we introduce a novel, simple yet effective pruning method, termed Wanda (Pruning by Weights and activations), to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA, one of the best performing LLMs available. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update.

Chat is not available.