Proteus: Lookup-Free Trellis-Coded Quantization by Lattice-Breaking Compute Codes for 2-Bit LLMs
Abstract
Autoregressive decoding of large language models is frequently memory-traffic bound, so ultra-low-bit weight-only PTQ helps only if dequantization avoids irregular codebook or LUT access in the inner loop. Under the GPU-friendly bitshift trellis, existing 2-bit trellis-coded quantization (TCQ) pipelines either reintroduce micro-LUTs or suffer overlap-amplified artifacts because incoherence improves global Gaussianity but does not guarantee overlap-local joint geometry. We introduce Proteus a strictly lookup-free TCQ framework whose computed generator MUL-BAL uses cheap integer mixing plus a per-layer affine Gaussianizer to produce overlap-robust, near-Gaussian code values with zero runtime table loads. Proteus instantiates each layer by selecting from a tiny, pre-vetted candidate pool and then applies lightweight channel compensation and optional few-shot distillation that tune only per-layer affine statistics while keeping packed indices and the bitshift-trellis decoder fixed. On Llama 2 (7B–70B) at 2-bit PTQ, Proteus improves perplexity and zero-shot accuracy over strong TCQ/PTQ baselines and reduces end-to-end decode bandwidth at comparable throughput (e.g., 740 vs. 1020 GB/s on 70B).