Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models

Georgy Tyukin · Gbetondji Dovonon · Jean Kaddour · Pasquale Minervini


Abstract:

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to their size and quadratic input length complexity. In this work, we investigate the effect of dropping various layers at inference time on the performance of LLama2 models. We find that dropping deeper attention layers, which we call \emph{inference-time attention removal} (ITAR), only marginally decreases performance. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 0.9\% drop in average performance over the OpenLLM benchmark (ARC, HellaSwag, TruthfulQA). Removing attention sublayers lead to a lower drop in performance and bigger runtime improvements than removing the feed-forward sublayers.

Chat is not available.