TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling
Abstract
Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose TileQ, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for TileQ that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that TileQ cuts down additional memory usage up to 10x and reduces inference latency to 5% while preserving state-of-the-art accuracy.