Large Language Models Explore by Latent Distilling
Abstract
Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration and risking omission of correct solutions. In this paper, we propose Exploratory Sampling (ES), a decoding approach that explicitly encourages semantic diversity during generation. ES is motivated by the observation that neural networks tend to make more accurate predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ES uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ES is implemented with an asynchronous training–inference pipeline and introduces less than 5\% throughput overhead in standard serving scenarios. Empirical results show that \ES achieves robust generalization across mathematics, science, and code generation benchmarks. Notably, it breaks the trade-off between diversity and coherence in creative writing, and significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable per- formance to strong stochastic and heuristic baselines.