Skip to yearly menu bar Skip to main content


Efficient Training of BERT by Progressively Stacking

Linyuan Gong · Di He · Zhuohan Li · Tao Qin · Liwei Wang · Tie-Yan Liu

Pacific Ballroom #50

Keywords: [ Unsupervised and Semi-supervised Learning ] [ Natural Language Processing ] [ Deep Sequence Models ]


Unsupervised pre-training is popularly used in natural language processing. By designing proper unsupervised prediction tasks, a deep neural network can be trained and shown to be effective in many downstream tasks. As the data is usually adequate, the model for pre-training is generally huge and contains millions of parameters. Therefore, the training efficiency becomes a critical issue even when using high-performance hardware. In this paper, we explore an efficient training method for the state-of-the-art bidirectional Transformer (BERT) model. By visualizing the self-attention distribution of different layers at different positions in a well-trained BERT model, we find that in most layers, the self-attention distribution will concentrate locally around its position and the start-of-sentence token. Motivating from this, we propose the stacking algorithm to transfer knowledge from a shallow model to a deep model; then we apply stacking progressively to accelerate BERT training. The experimental results showed that the models trained by our training strategy achieve similar performance to models trained from scratch, but our algorithm is much faster.

Live content is unavailable. Log in and register to view live content