Poster Mon, Jul 6, 2026 • 10:00 PM – 11:45 PM PDT HALL A #2015

Reasoning Cache: Learning to Extrapolate to Long Lengths via Short-Length RL

Ian Wu ⋅ Yuxiao Qu ⋅ Amrith Setlur ⋅ Aviral Kumar

Abstract

Large Language Models (LLMs) that continue improving at test-time budgets far beyond their training budgets can solve harder problems by leveraging additional inference compute: we refer to this property as extrapolation. Standard on-policy RL operates on fixed problem distributions and training budgets, giving rise to a distribution shift between train and test that limits the resulting model's extrapolation capabilities. To address this, we introduce RC, an iterative decoding algorithm replacing standard autoregressive decoding that enables models to extrapolate to lengths an order of magnitude longer than those seen during training. RC exploits the asymmetry between summarization and generation capabilities present in LLMs to construct a decoding process that improves consistently over iterations. Its effectiveness can be further increased through training, which amplifies the model’s ability to perform summary-conditioned reasoning while avoiding the challenges of long-horizon RL. Empirically, training a 4B instruction-following model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to 70% when evaluated with a 512k-token test budget, substantially surpassing comparably sized LLMs.