Timezone: »

 
Oral
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng · Lianmin Zheng · Binhang Yuan · Zhuohan Li · Max Ryabinin · Beidi Chen · Percy Liang · Christopher Re · Ion Stoica · Ce Zhang

Wed Jul 26 07:40 PM -- 07:48 PM (PDT) @ Meeting Room 313

The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen.

Author Information

Ying Sheng (Stanford University)

Ying Sheng is a PhD student from Stanford University advised by Clark Barrett. Her recent research topics include large language models and program verification. She has been looking into model serving and inference in different aspects. Among those, she created FlexGen, an initial effort for high-throughput inference on limited resources. Her recent focus is to help build the MLSYS org, which aims to make large models accessible to everyone. On the other hand, she is one of the developers of cvc5, one of the mainstream SMT solvers. Her works in SMT have won the best paper and best tool paper awards at IJCAR and TACAS.

Lianmin Zheng (UC Berkeley)
Binhang Yuan (Swiss Federal Institute of Technology)
Zhuohan Li (UC Berkeley)
Max Ryabinin (Yandex/HSE University)
Beidi Chen (CMU / FAIR)
Percy Liang (Stanford University)
Christopher Re (Stanford University)
Ion Stoica (University of California, Berkeley)
Ce Zhang (ETH Zurich)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors