Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
Benjamin Bergner · Andrii Skliar · Amelie Royer · Tijmen Blankevoort · Yuki Asano · Babak Ehteshami Bejnordi
Abstract:
Large language models (LLMs) are widely used for text generation, but their size and reliance on autoregressive decoding increase deployment costs and latency. We propose a hybrid approach that combines different-sized language models to improve efficiency while maintaining performance. Our method uses a pretrained LLM to encode prompt tokens in parallel, guiding a small language model (SLM) to generate responses more efficiently. By combining encoder-decoder LLMs with encoder-decoder and decoder-only SLMs, we achieve up to 4x speedup with minor performance penalties of 1-2% for translation and summarization tasks compared to the LLM.
Chat is not available.