ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
Zihao Huang ⋅ Jundong Zhou ⋅ Xingwei Qu ⋅ Qiyang Min ⋅ Ge Zhang
Abstract
Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concepts through learnable chunking at target compression ratio $R$. The MoE architecture enables controlled evaluation: reallocating saved computation to match baseline FLOPs and parameters isolates genuine architectural benefits. ConceptMoE consistently outperforms standard MoE, achieving +0.9 points on language pretraining, +2.3 on long context, and +0.6 on multimodal tasks. Continual training conversion with layer looping gains +5.5 points. Beyond performance, at $R=2$, ConceptMoE reduces attention computation by $R^2\times$ and KV cache by $R\times$, delivering prefill speedups up to 175\% and decoding speedups up to 117\%. The minimal architectural changes enable straightforward integration, demonstrating that adaptive concept-level processing fundamentally improves LLM effectiveness and efficiency.
Successful Page Load