On Efficient Scaling of GNNs via IO-Aware Layers Implementations
Daria Fomina ⋅ Daniil Krasylnikov ⋅ Alexey Boykov ⋅ Andrey Dolgovyazov ⋅ Vyacheslav Zhdanovskiy ⋅ Fedor Velikonivtsev
Abstract
Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edge-wise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to **3.9**$\times$ speedup for Graph Transformer (median **1.6**$\times$), with Tensor Core (block-sparse) variants up to **7.3**$\times$ on locally dense graphs; for GATv2 we reach up to **8.5**$\times$ speedup (median **2.0**$\times$) while reducing peak memory by up to **76**$\times$ (median **6**$\times$). Our degree-aware reduction kernels achieve up to **10**$\times$ speedup (median **2.6**$\times$). For SpMM-based layers, properly cached cuSPARSE achieves up to **8**$\times$ speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.
Successful Page Load