DELTA4: Sparse Matrix-Vector Multiplication for Low Sparsity
Vladimír Macko ⋅ Vladimír Boža
Abstract
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low, unstructured sparsity ($30-90\\%$) commonly observed in pruned LLMs, unstructured pruning provides only limited memory reduction and speedup. We propose **DELTA4-SpMV**, a GPU-optimized format and kernel co-designed to reduce storage overhead while remaining compatible with the GPU’s execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units or precomputation. We identify memory bandwidth as the primary limiting factor of SpMV and analyze the storage overhead of DELTA4. At $50\\%$ sparsity, DELTA4 is the first approach to achieve $1.5\times$ memory reduction and $1.2-1.5\times$ speedup over the dense baseline as well as substantial improvements over other SpMV methods: cuSPARSE ($2.8-13.0\times$), Sputnik ($1.9-2.6\times$), and DASP ($2.2-2.5\times$). An LLM pruned with Wanda to sparsity $50\\%$ requires $1.5\times$ less memory and achieves $1.5\times$ faster inference at fp16 precision. As a result, **unstructured pruning at $50\\%$ sparsity becomes practical** for real-world LLM workloads and **bridges the efficiency gap with structured 2:4 sparsity**.
Successful Page Load