Representation Drift Compensation: A Zero-Cost Enhancement for LLM Decomposition
Abstract
While low-rank decomposition offers potential for LLM size reduction, its application is limited by considerable performance degradation. In this work, we identify and formalize a key overlooked issue in LLM decomposition: \textit{representation drift}. We show that approximation errors introduced by decomposition propagate and amplify non-linearly through the deep layers of the transformer architecture, progressively distorting internal representations and degrading downstream performance. To mitigate this, we introduce a conceptually simple but principled compensation mechanism, named ``\our'', that operates by suppressing error at its source. By learning to align the output distribution of decomposed transformer blocks with their original counterparts, our method effectively counteracts representation drift, achieving notable performance recovery with zero inference overhead. Extensive experiments across OPT, LLaMA-2, LLaMA-3, and QWen exhibit remarkable improvements in language modeling, common-sense reasoning, knowledge-based reasoning, and vision-language tasks. For instance, on LLaMA-3-8B and OPT-13B at 40% compression, perplexity is reduced by more than 70% while reasoning task accuracy improves by over 10%. Our code is available at this anonymous URL.