AdaHC: Accelerating Multi-Token Prediction with Adaptive Head Chunking with Pipeline Parallelism
Yan Wang ⋅ Chang Si ⋅ Kaiming Yang ⋅ Zhipeng Zhang ⋅ Weijian Liu ⋅ Man Yuan ⋅ Mingzhen Li ⋅ Yong Li ⋅ Weile Jia
Abstract
Multi-token prediction (MTP) architecture is widely adopted in LLMs. MTP blocks can be appended to the tail of model to predict additional tokens. However, when training with pipeline parallel, MTP leads to more pipeline bubbles and deteriorates the pipeline efficiency. Based on in-depth analysis of MTP architectures and loss functions, we have identified the parallel nature of the MTP blocks, and leverage it for superior pipeline scheduling. We propose AdaHC, an adaptive pipeline scheduling framework for accelerating LLMs training with MTP block(s). AdaHC splits the output heads into chunks and reassembles the chunks to generate balanced pipeline stages, and performs adaptive activation forwarding to preserve the numerical equivalence. Experimental results show that AdaHC improves the training throughput of SOTA LLMs with diverse MTP configurations by 1.35$\times$ on average. This work paves a new direction for practical pipeline training.
Successful Page Load