NITP: Next Implicit Token Prediction for LLM Pre-training
Abstract
Standard Next-Token Prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse, one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense, continuous supervision directly in the representation space. NITP requires the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. Theoretically, we show that NITP regularizes the optimization landscape by eliminating under-constrained degrees of freedom and enforcing a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. Notably, on the 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with ~2% additional training FLOPs and no additional inference cost.