ImpQuant: Fine-Grained Importance-Aware Quantization for Large Vision-Language Models
Abstract
Large Vision–Language Models (LVLMs) have demonstrated remarkable capabilities across diverse multimodal tasks, yet their high inference costs necessitate low-bit deployment. Existing post-training quantization (PTQ) pipelines primarily adopt methodologies from text-only LLMs by treating multimodal inputs as homogeneous sequences, overlooking the heterogeneous information density inherent in LVLMs. In this work, we present ImpQuant, an importance-aware PTQ framework tailored for LVLMs that mitigates low-bit accuracy degradation via fine-grained token-importance reweighted calibration and outlier-aware activation quantization. Our key insight is that quantization errors on decision-critical tokens disproportionately impact overall model behavior. Accordingly, we reweight the calibration loss using aggregated attention for textual tokens and a contextual redundancy metric for visual tokens, respectively. Across multiple LVLM backbones and diverse multimodal benchmarks, our approach consistently improves accuracy at low bitwidth and reduces quantization-induced object hallucinations compared to state-of-the-art PTQ baselines.