SpikingLM: Towards Fully Spiking Language Model
Abstract
Spiking Neural Networks (SNNs) offer a promising avenue toward energy-efficient language modeling by replacing multiply-accumulate operations with sparse, event-driven computation. However, constructing fully spiking language models reveals two fundamental challenges: (1) gradient degradation from dead neurons caused by diminishing input magnitudes in deep networks, and (2) reduced token selectivity due to the absence of softmax's competitive winner-takes-all mechanism. These limitations create a substantial performance gap that has hindered the practical deployment of spiking language models. To address these challenges, we introduce SpikingLM, a framework that bridges the efficiency of SNNs with the capabilities of modern language models through two key innovations. First, we propose Distribution-aware Scaling, which rescales linear outputs to an optimal range that prevents gradient vanishing. These parameters are fused into preceding linear layers at inference, incurring zero additional overhead. Second, we introduce Spike2Max, a hardware-efficient attention mechanism that restores winner-takes-all dynamics through base-2 exponentiation and max-subtraction. By exploiting the integer-valued nature of spike coincidence counts, Spike2Max replaces floating-point exponentials with bit-shift operations, reducing attention energy consumption by over 95\% compared to softmax. Extensive experiments demonstrate that SpikingLM achieves a 57.9\% reduction in energy consumption while delivering state-of-the-art performance on GLUE among spiking language models.