SubspacePath Pruner: Inference-time Pruning via Probe-based Representation–Parameter Coupling
Abstract
Large-scale dedicated application of LLMs in diverse scenarios increasingly demands specialized model inference behavior under strict constraints of accuracy, latency, and memory. However, the heterogeneous and long-tailed nature of real-world specialized scenarios makes it difficult to obtain training data and optimize models. We study a practical inference-time specialization setting: given an LLM base, we compile a reusable, budget-bounded pathway/subnetwork within a specific scenario. Our approach is motivated by an empirical coupling phenomenon: input scenario sets aligned with similar representation subspaces (e.g., domain) in embedding space tend to activate a consistent and sparse set of internal reasoning pathways in model parameter space. To build the bridge between them, we propose probe-based SubspacePath Pruner with two core components: (1) Domain-Basis Synthesis (DBS) constructs a quasi-orthogonal basis of domain axes in embedding space, serving as a stable coordinate system. (2) Probe-based Scenario Pruning (PSP) uses efficient layer-wise linear probes to estimate axis alignment and compute budgeted head-wise pathways for a specific scenario. Experiments on LLaMA-2-13B show 29.3 average Recall on cross-domain tests (vs. 24.7 dense) and 21.6 on cross-dataset tests (vs. 25.5 dense) with 1.27x speedup at ~30% pruning ratio.