SMM Transformer: Leveraging Spiking Neural Networks for Multimodal Tasks
Abstract
Spiking Neural Networks (SNNs) enable event-driven computation with sparse activations, but building multimodal Transformers on SNNs is hindered by unstable training in deep spiking stacks and a mismatch between dense softmax attention and spike-based communication. We propose SMM Transformer, an SNN-based multimodal Transformer framework that combines (i) a Parallel LIF with Multistage Learnable Parameters (PLMP) neuron and a tailored P-STBP algorithm to stabilize training, (ii) a spike-driven attention approximation (SMSA) with a lightweight self-compensation branch, and (iii) a spiking mixture-of-experts (SMoE) module for modality-aware fusion. Across visual and multimodal benchmarks, SMM Transformer achieves competitive accuracy compared to ANN baselines while reducing the estimated compute energy of the attention module by up to 97\% under a standard MAC/AC cost model.