SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity
Yingbo HAO ⋅ Hanyong Shao ⋅ Ting Song ⋅ Yan Xia ⋅ Di Zhang ⋅ Shaohan Huang ⋅ Xun Wu ⋅ Songchen Xu ⋅ Le Xu ⋅ Li Dong ⋅ Zewen Chi ⋅ Yi Zou ⋅ Furu Wei
Abstract
NVIDIA's 2:4 Sparse Tensor Cores deliver $2\times$ throughput but demand 50% pruning—a ratio that collapses LLM reasoning accuracy (Qwen3: 54%→15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive *no* hardware support, falling back to dense execution. We present **SlideSparse**, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our *Sliding Window Decomposition* losslessly rewrites any $(2N-2):2N$ block into $N-1$ overlapping 2:4-compliant windows; *activation lifting* fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, **SlideSparse** is evaluated across five GPUs (A100, H100, B200, RTX 4090, RTX 5080), three precisions (INT8, FP8, BF16), and the Llama/Qwen/BitNet model families. On compute-bound workloads, speedup approaches the theoretical $N/(N-1)$ limit—Qwen2.5-7B with 6:8 sparsity achieves $1.33\times$, matching the bound exactly—establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration.
Successful Page Load