ICML Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models

Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models

Georgy Tyukin · Gbetondji Dovonon · Jean Kaddour · Pasquale Minervini

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to their size and quadratic input length complexity. In this work, we investigate the effect of dropping various layers at inference time on the performance of LLama2 models. We find that dropping deeper attention layers, which we call \emph{inference-time attention removal} (ITAR), only marginally decreases performance. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 0.9\% drop in average performance over the OpenLLM benchmark (ARC, HellaSwag, TruthfulQA). Removing attention sublayers lead to a lower drop in performance and bigger runtime improvements than removing the feed-forward sublayers.

Chat is not available.

Poster in Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)

Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models

Georgy Tyukin · Gbetondji Dovonon · Jean Kaddour · Pasquale Minervini

Poster
in
Workshop: Workshop on Theoretical Foundations of Foundation Models (TF2M)