Skip to yearly menu bar Skip to main content


Poster

Pruning Small Pre-Trained Weights $\textit{Irreversibly}$ and $\textit{Monotonically}$ Impairs ``Difficult" Downstream Tasks in LLMs

Lu Yin · Ajay Jaiswal · Shiwei Liu · Souvik Kundu · Zhangyang “Atlas” Wang


Abstract: We present $\textit{Junk DNA Hypothesis}$ by adopting a novel $\textit{task-centric}$ angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by $\textit{pruning}$ without compromising performance. Contrary to this belief, this paper presents a $\textit{counter-argument}$: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the $\textbf{monotonic}$ $\textbf{relationship}$ between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in $\textbf{irreparable loss}$ of knowledge and performance degradation in difficult tasks, even when downstream continual training is allowed. Interestingly, our evaluations show that the other popular compression, namely $\textbf{quantization}$ $\textbf{fail}$ to exhibit similar ``monotonic" effect and does not as convincingly disentangle this task-difficulty information. To study formally, we introduce several quantifiable metrics to \textit{gauge the downstream task difficulty}: (a) within the same task category, and (b) across different task categories. Our extensive experiments substantiate the Junk DNA Hypothesis across a diverse range of model sizes, tasks, datasets, and even pruning methods. Codes will be released.

Live content is unavailable. Log in and register to view live content