Spatially-Regularized Entropy for Discriminative Token Merging in Fine-Grained Re-Identification
Abstract
While Vision Transformers (ViTs) offer strong global modeling, their quadratic computational cost limits utility in latency-sensitive applications like person re-identification (ReID). Existing compression strategies, such as token pruning or generic merging, typically rely on coarse-grained criteria tailored for image classification. In fine-grained retrieval, these approaches often discard or smooth out subtle but discriminative local details. To resolve this, we propose SRE-Merge, a training-free framework designed for discriminative token compression. SRE-Merge injects spatial priors into the merging process through three mechanisms: (i) Spatial-Entropy Saliency Assessment (SES-Assess), which quantifies token importance as Spatial-Entropic Mass (SE-Mass) by coupling spatial structure with local attention entropy; (ii) Hybrid Context-Affinity Matching (HCA-Match), which guides precise pair selection by combining feature similarity with mass-derived context; and (iii) Energy-Preserving Weighted Fusion (EPW-Fuse), which incorporates SE-Mass weighting to counteract feature variance reduction. Extensive experiments on standard benchmarks show that SRE-Merge reduces GFLOPs of the base ViT model by about 24\% while maintaining state-of-the-art accuracy, establishing a superior accuracy-efficiency trade-off.