Skip to yearly menu bar Skip to main content


Poster

FlexSM: Flexible Spatial-Temporal Multiplexing for LLM Serving

Jiangfei Duan · Runyu Lu · Haojie Duanmu · Xiuhong Li · Xingcheng ZHANG · Dahua Lin · Ion Stoica · Hao Zhang


Abstract:

Generative large language models (LLMs) have demonstrated remarkable performance across various domains. However, serving LLMs poses substantial challenges due to their considerable computation and memory requirements. Moreover, the varying popularity and demand for different LLMs lead to significant resource under-utilization in spatially partitioned or temporally multiplexed multiple model serving systems. In the paper, we present FlexSM, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. FlexSM formally formulate the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. FlexSM designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that FlexSM can achieves up to 1.8× higher throughput or processes 2.9× more requests within 99% SLO attainment.

Live content is unavailable. Log in and register to view live content