Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang · Daniele Paliotta · Avner May · Alexander Rush · Tri Dao


Abstract:

Recent research suggests that state-space models (SSMs) like Mamba can be competitive with Transformer models for language modeling with advantageous deployment characteristics. Given the focus and expertise on training large-scale Transformer models, we consider the challenge of converting these pretrained models into SSMs for deployment. We demonstrate that it is feasible to distill large Transformers into SSMs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of state-space models. Overall we show how, with limited computation resources, we can distill a large Transformer into a hybrid SSM and decode it efficiently.

Chat is not available.