OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
Jingze Shi ⋅ Zhangyang Peng ⋅ Yizhang Zhu ⋅ Yifan Wu ⋅ Guang Liu ⋅ Yuyu Luo
Abstract
Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. In this paper, we propose OmniMoE, a system-algorithm co-designed MoE framework that pushes granularity to the extreme with vector-level Atomic Experts, orchestrating their routing and execution at scale within a single MoE layer, while retaining a shared dense MLP for general-purpose processing. While this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from $O(N)$ to $O(\sqrt{N})$; and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9\% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9$\times$ speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate.
Successful Page Load