Poster
in
Workshop: ES-FoMo: Efficient Systems for Foundation Models

Generating Efficient Kernels for Quantized Inference on Large Language Models

Tommaso Pegolotti ⋅ Elias Frantar ⋅ Dan Alistarh ⋅ Markus Püschel

Project Page [ Poster] [ OpenReview]

Abstract

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.

Video

Chat is not available.