ICML Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner · Andrii Skliar · Amelie Royer · Tijmen Blankevoort · Yuki Asano · Babak Ehteshami Bejnordi

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.

Chat is not available.

Poster in Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Benjamin Bergner · Andrii Skliar · Amelie Royer · Tijmen Blankevoort · Yuki Asano · Babak Ehteshami Bejnordi

Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models