CoCoQuant: Breaking the Bandwidth Wall via Co-Optimized Communication and Computation Quantization
Abstract
The rapid scaling of large language models (LLMs) has made distributed inference indispensable, yet end-to-end latency is increasingly dominated by communication, forming a critical bandwidth wall that fundamentally limits the practical gains of existing quantization techniques. Existing approaches typically treat communication and computation in isolation, failing to exploit their coupled nature and introducing limited system-level acceleration and accuracy degradation. To address this, we propose CoCoQuant, a co-designed framework that jointly optimizes communication and computation as a unified end-to-end design space. CoCoQuant introduces a precision-aligned graph-rewriting that enables zero-overhead fusion between low-precision communication and computation. CoCoQuant formulates a hardware-aware mixed-precision allocation problem that integrates roofline-based cost modeling with relative sensitivity calibration, solved via global integer linear programming. Extensive experiments on LLMs of varing scales demonstrate that CoCoQuant achieves Pareto-optimal accuracy-latency trade-offs, delivering up to 2.92 end-to-end speedup with a negligible increase in perplexity (0.22).