Poster
in
Workshop: Next Generation of Sequence Modeling Architectures
BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
Qizhen Zhang · Nikolas Gritsch · Dwaraknath Gnaneshwar · Simon Guo · David Cairuz · Bharat Venkitesh · Jakob Foerster · Phil Blunsom · Sebastian Ruder · Ahmet Üstün · Acyr Locatelli
Training Mixture of Experts (MoEs) from scratch in a large-scale regime is expensive. Previous work addresses this challenge by independently training multiple dense expert models and using them to initialize an MoE. In particular, initializing MoE layers using experts' feed-forward parameters while merging all other parameters. This limits the advantages of the specialized dense models when ``upcycling'' them as MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective improvement to MoE training. BAM makes full use of specialized dense models by not only using their feed-forward network (FFN) to initialize the MoE layers but also leveraging experts’ attention weights fully by initializing them as Mixture of Attention (MoA) layers. Our experiments using seed models ranging from 590 million to 2 billion parameters show that our approach outperforms state-of-the-art approaches under the same data and compute budget in both perplexity and downstream tasks evaluations.