Poster Mon, Jul 6, 2026 • 6:30 PM – 8:15 PM PDT HALL A #4203

RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

Zhengjia Zhong ⋅ Shuyan Ke ⋅ Zaizhou Lin ⋅ Jiaqi Song ⋅ Hongyi Lan ⋅ Hui Li

Project Page

Abstract

Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while it can provide 6×–14× faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.

Lay Summary

Modern AI systems often represent images, text, or users as long numerical vectors. To store and search these vectors efficiently, systems usually compress them into small sets of discrete IDs. However, existing compression methods often struggle to accurately describe the complex structure of real-world data, while more adaptive methods can become slow because reconstruction must happen step by step. We developed a new method that converts vectors into discrete IDs in a more flexible and efficient way. Instead of using the same fixed compression pattern for every vector, our approach dynamically adjusts to the local structure of the data. We also redesigned the reconstruction process so different stages can run independently, avoiding the delays caused by sequential processing. Our experiments show that this method preserves information more accurately after compression while also improving efficiency. This could help large-scale AI systems such as search engines, recommendation systems, and generative AI tools store and process massive amounts of vector data more effectively.