Don't Drop Dropout: Optimizing Layer Sparsity for Efficient LLM Training and Inference
Abstract
Layer dropout (a.k.a.\ stochastic depth) has been shown to enable faster training, higher accuracy, and robustness to zero-shot layer pruning in both language and vision transformers. However, as models and datasets have scaled, dropout---particularly layer dropout---has largely disappeared from LLM pre-training recipes. While some prior work has reported that dropout can degrade accuracy, no comprehensive study has quantified, let alone mitigated, this effect. In this study, we show that layer dropout should be used in state-of-the-art LLM training, establishing best practices and scaling analysis for both training and post-training benefits. Concretely, with optimal layer distribution, time schedule, and optimizer hyperparameters, a 3.9B-parameter LLM can achieve \textbf{lower validation loss} while saving 20\% of training FLOPs. Moreover, layer dropout enables significant post-training optimizations, such as early exit, intermediate-layer skipping, and self-speculative decoding, yielding up to 1.7x inference speedup with negligible accuracy loss. Across more than 2400 training experiments, spanning models from 271M to 3.9B parameters and datasets up to 116B tokens, we demonstrate that these findings extend reliably to large-scale training regimes.