Rethinking Attention in Spiking Transformers: Overcoming Density Bias with Set Similarity
Abstract
Recent Spiking Transformer models have explored a variety of attention mechanisms beyond standard dot-product formulations. However, many existing similarity-based spiking attention formulations remain inherently sensitive to firing density, causing neurons with high spike rates to dominate attention scores regardless of semantic relevance. This density bias is particularly problematic in event-driven spiking representations, where sparse spike patterns often carry critical information. To address this limitation, we rethink spiking attention from a set-theoretic perspective. We propose DiceFormer, a novel Spiking Transformer architecture driven by Spike Dice Attention (SDA). Unlike traditional approaches, SDA replaces density-sensitive measures with a set similarity function derived from the Dice coefficient. By explicitly normalizing for firing density, SDA focuses on spike co-occurrence rather than high firing rates. We primarily evaluate DiceFormer on the challenging audio domain, where spike sparsity varies substantially across inputs. On AudioSet-20k, DiceFormer achieves a SOTA mAP of 0.161 with 54.3M parameters, outperforming prior SNN-based approaches and substantially narrowing the performance gap with ANN-based models. We also introduce Lin-SDA, a linearized version for computation efficiency, while achieving performance comparable to SDA. Beyond audio, we evaluate the effectiveness of SDA on CIFAR-100 to verify its applicability to the vision domain.