SparseInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation
Qinsi Wang ⋅ Saeed Vahidian ⋅ Hancheng Ye ⋅ Jianyang Gu ⋅ Jianyi Zhang ⋅ Yiran Chen
Abstract
Large Language Models (LLMs) with billions of parameters have transformed AI applications but require immense computational and memory resources during inference. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Yet existing token-level MLP-based methods frequently alter activation maps, reducing efficiency gains. In this paper, we introduce \textbf{SparseInfer}, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. We first propose the concept of core neurons and empirically demonstrate that, for an input sentence, LLMs only need the core neurons to maintain performance. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics—an insight overlooked by previous studies. Building on this finding, we design two semantic-based methods for predicting core neurons to fit different input scenarios, which enables core neurons to be determined during the pre-filling stage and fixed during the encoding stage. Our experiments verify SparseInfer exhibits good performance in various tasks and achieve 10.33$\times$ speed up.
Successful Page Load