Poster Tue, Jul 7, 2026 • 10:30 PM – 12:15 AM PDT HALL A #4212

REViT: Roto-reflection Equivariant Convolutional Vision Transformer

Sheir A. Zaheer ⋅ Alexander Holston ⋅ Chan Youn Park

Abstract

In this paper, we propose a discrete roto-reflection group equivariant vision transformer with convolutional attention. Roto-reflection equivariant networks preserve the rotational, flip and positional symmetry in feature maps, making them useful for tasks where orientation of the inputs is relevant to the model outputs. In image classification and object detection, most of the studies on roto-reflection equivariant models have focused on using convolutional neural networks rather than vision transformers. In this paper, we examine the challenges involved in achieving equivariance in vision transformers, and we propose a simpler way to implement a discretized roto-reflection group equivariant vision transformer. The experimental results demonstrate that our approach outperforms the existing approaches for developing discrete roto-reflection group equivariant neural networks for image classification.

Lay Summary

REViT is a new type of vision transformer designed to better understand images regardless of how objects are rotated or flipped. Traditional image recognition systems often struggle when objects appear in different orientations, since rotations or flips can significantly change the model’s internal representation of the same object. As a result, they typically rely on large amounts of augmented training data to learn these variations. REViT addresses this problem by building rotational and reflection symmetry directly into the model architecture, allowing it to naturally recognize patterns even when images are transformed. The proposed approach combines ideas from convolutional neural networks and vision transformers to create a simpler and more efficient way of achieving rotation and flip awareness in transformer-based models. REViT's redefined attention reduces complex positional computations resulting in reduced model sizes. Experiments on multiple image classification benchmarks show that REViT outperforms existing equivariant neural networks and standard vision transformers, while also scaling successfully to large datasets such as ImageNet. These results suggest that equivariant vision transformers could improve the robustness and reliability of computer vision systems used in areas such as medical imaging, robotics, scientific analysis, and autonomous systems, where objects may appear in many different orientations and their orientation is relevant for a meaningful inference