Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #107

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Jiarui Feng ⋅ Hanqing Zeng ⋅ Karish Grover ⋅ Ruizhong Qiu ⋅ Yinglong Xia ⋅ Qiang Zhang ⋅ Qifan Wang ⋅ Ren Chen ⋅ Dongqi Fu ⋅ Jiayi Liu ⋅ Zhuokai Zhao ⋅ Xiangjun Fan ⋅ Benyu Zhang ⋅ Yixin Chen

Abstract

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models. Despite significant progress, effectively scaling MoE performance remains a challenge. Previous work shows that the use of fine-grained experts enlarges the space of expert combinations and can improve flexibility, but it also imposes substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling --- expert-output mixture. We first analyze the limitations of the standard weighted-summation aggregation in conventional MoE architectures. We then theoretically demonstrate that introducing structural aggregation both expands the expert-combination space without altering the experts or router configuration and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. We evaluate DAG-MoE under standard language modeling settings. Extensive experiments show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.