Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models

Learned Best-Effort LLM Serving

Siddharth Jha · Coleman Hooper · Xiaoxuan Liu · Sehoon Kim · EECS Kurt Keutzer


Abstract:

Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10× higher client request rates, serves above 96% of peak performance 4.1× more often, and serves above 98% of peak performance 2.3× more often than static serving on unpredictable workloads.

Chat is not available.