Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution

Wan Song ⋅ Zhou Wei ⋅ Rui Wang ⋅ Jun Yu ⋅ Toru Kurihara ⋅ Xu Jiajia ⋅ shu zhan

Project Page

Abstract

Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation. While Large Kernel Acceleration (LKA) helps on small feature maps, it becomes \textbf{counterproductive on large feature maps}, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which \emph{partitions} input into contiguous windows and \emph{indexes} a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication; this yields a unique property where \textbf{WBMM's throughput improves with larger windows}, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with $14 \times 14$ windows \textbf{outperforms $5 \times 5$ depthwise convolution baselines in speed} while providing $7.8\times$ larger receptive field, and combined with inter-block cross-window communication and hierarchical window reparameterization, achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31--1.88$\times$ training speedup. WBMM also demonstrates consistent advantages across diverse hardware platforms including GPU, CPU, and edge devices, without requiring specialized acceleration kernels. Code and models will be publicly available.