Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space
Abstract
Current approaches for scaling inference-time compute in transformers train them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and rely solely on serially-generated, natural-language verbalization. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens requiring more computation can form a "bubble" of cloned residuals in the middle of the network. Crucially, this behavior is learned during pretraining with only language modeling loss. Using half of the training budget, Thoughtbubbles outperforms the perplexity and zero-shot evals of both standard decoder LMs and those using non-adaptive parallel computation approaches. These results hold across model sizes from 150M to 1.9B. Thoughtbubbles achieves competitive GSM8K results using half of the baseline's token budget. The implicit nature of our method enables models to begin learning adaptive computation at pretraining time, paving the way to unified train-time and test-time scaling behaviors.