Poster
in
Workshop: The Second Workshop on Spurious Correlations, Invariance and Stability
Reviving Shift Equivariance in Vision Transformers
Peijian Ding · Davit Soselia · Thomas Armstrong · Jiahao Su · Furong Huang
Shift equivariance, integral to object recognition, is often disrupted in Vision Transformers (ViT) by components like patch embedding, subsampled attention, and positional encoding. Attempts to combine convolutional neural network with ViTs are not fully successful in addressing this issue. We propose an input-adaptive polyphase anchoring algorithm for seamless integration into ViT models to ensure shift-equivariance. We also employ depth-wise convolution to encode positional information. Our algorithms enable ViT, and its variants such as Twins to achieve 100\% consistency with respect to input shift, demonstrate robustness to cropping, flipping, and affine transformations, and maintain consistent predictions even when the original models lose 20 percentage points on average when shifted by just a few pixels with Twins' accuracy dropping from 80.57\% to 62.40\%.