Escaping the Subspace Trap: The Role of Optimizer Geometry in Model Width Expansion
Abstract
Pre-training large language models from scratch is prohibitively expensive as model scales increase. A practical alternative is Model Width Expansion (MWE), which grows a larger model from a well-pretrained ''seed'' model to inherit existing capabilities at initialization. However, we identify a phenomenon termed the Subspace Trap: during continual pre-training, parameter updates largely stagnate within a low-dimensional subspace aligned with the initialization, limiting the effective capacity of the expanded model. Our theoretical analysis investigates this issue by attributing it to the function-preserving properties of width expansion. In particular, element-wise adaptive optimizers remain confined to the trap, whereas optimizers that yield an isotropic geometry of parameter updates can escape. To demonstrate the impact of the subspace trap on model performance, we conduct empirical experiments across different model sizes and model families, which show that escaping the trap is principally effective in improving training efficiency and overall model performance. Detailed mechanistic analyses further confirm that escaping the trap indeed activates the new dimensions to encode general knowledge. Our code is available at https://anonymous.4open.science/r/MWE-1B46.