Poster

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

Ali Taghibakhshi ⋅ Ruisi Cai ⋅ Saurav Muralidharan ⋅ Sharath Turuvekere Sreenivas ⋅ Ameya Mahabaleshwarkar ⋅ Marcin Chochowski ⋅ Akhiad Bercovich ⋅ Ran Zilberstein ⋅ Ran El-Yaniv ⋅ Yonatan Geifman ⋅ Daniel Korzekwa ⋅ Yoshi Suhara ⋅ Oluwatobi Olabiyi ⋅ Ashwath Aithal ⋅ Nima Tajbakhsh ⋅ Pavlo Molchanov

Abstract

Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (Nx savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation in efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel approach that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. We apply Star Elastic to the NVIDIA Nemotron Nano models; in particular, we demonstrate its effectiveness on hybrid MoE architectures with Nemotron Nano v3 (30B/3.6A), generating 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. For Nemotron Nano v2 (12B), we produce 9B and 6B nested models using only 110B training tokens, achieving a 360x reduction versus training from scratch and a 7x reduction over state-of-the-art compression methods. All nested models match or outperform independently trained baselines of comparable size. Crucially, elastic budget control advances the accuracy--latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection.