CLIMB: Taming the LoRA Residency Cliff in Multi-LoRA Serving
Haoran Zhang ⋅ Zhiyu Liang ⋅ ZUO Decheng ⋅ Hongzhi Wang
Abstract
Multi-tenant multi-LoRA serving multiplexes many LoRA adapters on a single GPU under high utilization, where most device memory is reserved for the KV cache, leaving only a small residency budget $K$ for adapters. In this regime, adapter readiness is atomic: if an adapter is not device-resident, the engine must perform a mandatory fetch, stalling shared execution and amplifying tail latency system-wide. With only $K$ residency slots, we identify a LoRA residency cliff: once the active adapter working set exceeds $K$, time-to-first-token (TTFT) tail latency can exhibit a congestion collapse rather than smooth degradation. To tame this cliff, we propose CLIMB, a minimal ingress controller that enforces feasibility-first admission by queueing non-resident adapters outside the engine, prioritizing critical (VIP) traffic, and rotating background adapters via round-robin. On a cliff-inducing workload, CLIMB averts collapse, reducing VIP TTFT p99 from 38.7 s to 13.1 s at matched throughput (10.66 rps) by keeping VIP engine latency near 0.13 s and shifting the residual tail into explicit ingress queueing. Overall, CLIMB shifts fetch-induced stalls from inside the engine to managed ingress queues, mitigating tail amplification without throughput loss in the evaluated settings.
Successful Page Load