RaBiT: Residual Aware Binarization Training for Accurate and Efficient LLMs
Youngcheon You ⋅ Banseok Lee ⋅ Minseop choi ⋅ Seonyoung Kim ⋅ Hyochan Chong ⋅ Changdong Kim ⋅ Youngmin Kim ⋅ Dongkyu Kim
Abstract
Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization promises hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during Quantization-Aware Training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and crippling the model's expressive capacity. While prior work relies on heuristic workarounds (e.g., path freezing) that limit model capacity, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, ensuring each path corrects its predecessor's error. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a 4.49$\times$ inference speed-up over full-precision models on an RTX 4090.
Successful Page Load