Skip to yearly menu bar Skip to main content


Deep Fusion: Efficient Network Training via Pre-trained Initializations

Hanna Mazzawi · Xavi Gonzalvo · Michael Wunder · Sammy Jerome · Benoit Dherin

Hall C 4-9
[ ]
Wed 24 Jul 4:30 a.m. PDT — 6 a.m. PDT


Training deep neural networks for large language models (LLMs) remains computationally very expensive. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. In this paper, we propose a theoretical framework using backward error analysis to illuminate the dynamics of mid-training network growth. Furthermore, we introduce Deep Fusion, an efficient network training approach that leverages pre-trained initializations of smaller networks, facilitating network growth from diverse sources. Our experiments validate the power of our theoretical framework in guiding the optimal use of Deep Fusion. With carefully optimized training dynamics, Deep Fusion demonstrates significant reductions in both training time and resource consumption. Importantly, these gains are achieved without sacrificing performance. We demonstrate reduced computational requirements, and improved generalization performance on a variety of NLP tasks and T5 model sizes.

Live content is unavailable. Log in and register to view live content