Poster Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST Coex: HALL A

SpikeVLA: Vision-Language-Action Models with Spiking Neural Networks

Ruiqi Song ⋅ Dujun Nie ⋅ Siyu Teng ⋅ Baiyong Ding ⋅ Xiaotong Zhang ⋅ Dong Li ⋅ Chenming Zhang ⋅ Yuchen Li ⋅ Hangbin Wu ⋅ Long Chen

Abstract

Vision-Language-Action (VLA) models have become a central paradigm for embodied intelligence. However, most existing approaches are built on large-scale Transformers, resulting in substantial inference latency and energy consumption that limit their practical deployment in low-power, real-time scenarios. We propose SpikeVLA, an end-to-end spiking VLA framework for embodied navigation with energy-efficient inference, consisting of three key components. (i) a spiking vision encoder, Spike-V, that replaces dense continuous computation with event-driven spiking representations to reduce the energy cost of visual representation learning, (ii) a multimodal spiking large language model, Spike-L, that reformulates cross-modal reasoning with spiking dynamics and token-level event-driven sparsity to further lower inference overhead, and (iii) a spiking action policy network, Spike-A, that uses Laplacian-kernel population coding and end-to-end reinforcement learning to produce stable, robust continuous control under low-energy constraints. Experiments on multimodal interaction and robotic control tasks show that SpikeVLA significantly reduces energy consumption and computational overhead while maintaining competitive performance, highlighting its potential for low-power, real-time embodied intelligence.