Bring Future Vision: Dynamic Computation Allocation Guided by Lightweight Feature Forecaster
Abstract
The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing—a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments show that informed routing consistently achieves state-of-the-art performance across static and dynamic pruning approaches. We further present two practical inference pipelines: a pure-PyTorch implementation and a Triton-based custom operator, that translate these gains into real-world speedups, achieving practical acceleration and consistent improvement across various batch sizes.