Poster Mon, Jul 6, 2026 • 10:00 PM – 11:45 PM PDT HALL A #1908

Compute Where it Counts: Self Optimizing Language Models

Yash Akhauri ⋅ Mohamed Abdelfattah

Project Page

Abstract

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study \emph{dynamic budget allocation} for autoregressive decoding: learning how much computation to spend \emph{per token} from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete \emph{efficiency action} at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., “counterfactual” schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language‑model quality against soft penalties that encourage episode‑average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference‑efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3\% over uniform budget allocation strategies. \href{https://github.com/akhauriyash/SOL}

Lay Summary

Large language models are powerful but expensive because they usually spend the same computation on every word they generate. Yet some words are easy to predict, while others need more context or precision. We introduce Self-Optimizing Language Models (SOL), which add a small controller to a frozen language model. At each generation step, the controller decides how much computation to use, such as how much context to attend to, how much pruning to apply, or what activation precision to use. SOL improves quality at the same compute budget compared with fixed efficiency strategies, finding better accuracy–efficiency trade-offs across model sizes and tasks. It improves MMLU accuracy by up to 7.3% over uniform budget allocation. Our code is available at https://github.com/akhauriyash/SOL