Prism-MoE: Efficient Dense-to-MoE Conversion for Visual Autoregressive Generation
Abstract
Scaling up visual autoregressive models improves generation quality but incurs substantial inference costs. Mixture-of-Experts (MoE) architectures mitigate this issue through sparse activation and have proven effective in large language models. However, training MoE models from scratch remains prohibitively expensive, and dense-to-MoE conversion for visual autoregressive models is still underexplored. To enable low-cost and high-quality dense-to-MoE conversion, we propose Prism-MoE, an efficient framework for transforming pretrained dense visual autoregressive models into sparse MoE models. Prism-MoE consists of two key components. First, we introduce trajectory-consistent Initialization, which formulates expert initialization as a principled decomposition problem and preserves the generation trajectory of pretrained models. Second, we propose a confidence-adaptive sparse fine-tuning framework that aligns expert specialization with the information density of visual tokens via confidence-aware routing supervision. Experiments show that Prism-MoE achieves dense-to-MoE conversion with less than 10\% of the standard training budget, while maintaining generation quality comparable to dense baselines with only 37.5\% active parameters.