Spike-HTR: Spiking Neural Transformer for Handwritten Text Recognition
Xiubo Liang ⋅ Jinxing Han ⋅ Yuke Li ⋅ Haoqi Zhu ⋅ Yu Zhao ⋅ Hongzhi Wang
Abstract
Offline handwritten text recognition (HTR) is blank-dominated: task-relevant evidence lies in sparse ink strokes, yet mainstream recognizers still expend dense spatial compute and full-length width-axis token mixing across the canvas. Spiking neural networks (SNNs) promise activity-proportional computation, but static inputs make common frame repetition redundant and stochastic coding unstable under small timestep budgets. We propose Spike-HTR, a budgeted spiking Transformer that controls two coupled knobs: the spiking horizon $T$ and the effective token length $\ell_b$ after blank-guided reduction. InkCoder deterministically gates a shared static stem feature to form a stable coarse-to-fine temporal stream, and a stop-gradient CTC preview drives a CTC-aware keep-and-merge reducer to shorten the width-axis token stream before deep mixing. Trained from scratch without external pretraining, Spike-HTR reaches a rapid-response operating point and achieves $T{=}2$ val/test CERs of 3.5/5.4 on IAM, 2.3/2.5 on LAM, and 4.2/3.9 on READ2016. The implementation and scripts are included in the supplementary material.
Successful Page Load