Poster
in
Workshop: ES-FoMo II: 2nd Workshop on Efficient Systems for Foundation Models
Learned Best-Effort LLM Serving
Siddharth Jha · Coleman Hooper · Xiaoxuan Liu · Sehoon Kim · EECS Kurt Keutzer
Abstract:
Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort serving system that employs deep reinforcement learning to adjust service quality based on the task distribution and system load. Our best-effort system can maintain availability with over 10× higher client request rates, serves above 96% of peak performance 4.1× more often, and serves above 98% of peak performance 2.3× more often than static serving on unpredictable workloads.
Chat is not available.