AdaS: Adaptive Gradient Descent for Spiking Transformers
Abstract
Transformer-based Spiking Neural Networks (SNNs) combine Transformer performance with SNN energy efficiency through an event-driven self-attention mechanism. However, Spiking Transformers still lag behind their Artificial Neural Network (ANN) counterparts. Most existing studies address this issue through new architectural designs, yet none has considered optimization algorithms specific to Spiking Transformers. Here, we first analyze the gradient characteristics of Spiking Transformers and identify excessive noise from surrogate gradient learning as a major challenge to stable training. We then provide a quantitative definition of noise in the gradient update direction and propose an adaptive gradient descent method for spiking transforms, named AdaS. Since moderate update direction noise can enhance generalization, whereas excessive noise degrades training, AdaS adaptively adjusts the update direction noise to an optimal level, thereby improving the performance of Spiking Transformers. We conduct extensive experiments on various Spiking Transformer architectures and datasets from both computer vision and natural language processing. The results demonstrate that the proposed AdaS consistently enhances performance across different Spiking Transformers, validating its effectiveness and generalizability. This work presents the first systematic investigation of optimization algorithms specifically tailored for SNNs, offering a practical tool to narrow the accuracy gap with ANNs while preserving the energy advantages of spike-based computation.