LiftQuant: Continuous Bit-Width Control for Pareto-Optimal LLM Deployment
Abstract
Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), creating a "deployment gap" where LLMs cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a "lift-then-project" mechanism: we represent d-dimensional weight vectors by projecting a simple 1-bit lattice from a tunable D-dimensional "lifted" space. By adjusting the lifted dimension D, LiftQuant naturally yields an effective bit-width of D/d, allowing for seamless, continuous resolution adjustment rather than discrete steps. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization. Crucially, its decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly efficiency. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models. With a decoding throughput up to 6.7x faster than FP16, LiftQuant redefines compression as a continuous optimization problem, paving the way for a new generation of hardware-aware LLM deployment.