WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution
Wan Song ⋅ Zhou Wei ⋅ Rui Wang ⋅ Jun-KUT Yu ⋅ Toru Kurihara ⋅ Xu Jiajia ⋅ shu zhan
Abstract
Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation. While Large Kernel Acceleration (LKA) helps on small feature maps, it becomes \textbf{counterproductive on large feature maps}, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which \emph{partitions} input into contiguous windows and \emph{indexes} a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication; this yields a unique property where \textbf{WBMM's throughput improves with larger windows}, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with $14 \times 14$ windows \textbf{outperforms $5 \times 5$ depthwise convolution baselines in speed} while providing $7.8\times$ larger receptive field, and combined with inter-block cross-window communication and hierarchical window reparameterization, achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31--1.88$\times$ training speedup. WBMM also demonstrates consistent advantages across diverse hardware platforms including GPU, CPU, and edge devices, without requiring specialized acceleration kernels. Code and models will be publicly available.
Successful Page Load