Efficient Bilevel Optimization for CKA-Guided MoE Upcycling
Abstract
Upcycling, a strategy that initializes Mixture-of-Experts (MoE) by replicating pre-trained feed-forward or MoE networks to expand model capacity, has become a popular method in continual learning due to its effectiveness in mitigating catastrophic forgetting. However, existing paradigms rely on indiscriminate expansion prioritize performance at the cost of extreme inefficiency, introducing parameter redundancy without exploiting the structural heterogeneity essential for counteracting forgetting with architectural economy. To address this, we investigate the determinants of forgetting in training dynamics using Centered Kernel Alignment (CKA) and loss landscape flatness to analyze the behavior of pre- and post-expansion MoE layers, uncovering instability in deep-layer representations and heterogeneous expert sensitivity to new tasks, thereby demonstrating the potential of selective upcycling to eliminate redundancy. Consequently, we propose a dynamic bilevel optimization framework to guide adaptive upcycling, featuring an outer loop employing a Gumbel-Softmax differentiable mask to perform Neural Architecture Search (NAS) for adaptive growth, while an inner loop optimizes weight updates via task objectives and CKA-regularized replay. Experiments on TRACE benchmark demonstrate that our proposed method achieves better average accuracy with 80\% forgetting reduction, while effectively eliminating 60\% of redundant parameter expansion that standard upcycling would introduce.