Spherical SO(3) Equivariant Local Attention
Yusuke Sekikawa ⋅ Jun Nagata ⋅ Itsumi Araki ⋅ Ruka Eto
Abstract
Spherical signals provide a natural representation for omnidirectional perception and often benefit from equivariance to 3D rotations. Recent spherical vision transformers implement local self-attention on spherical grids, but most retain only partial $\mathrm{SO}(3)$ equivariance and rely on $\textit{location-dependent}$ local positional embeddings (LPEs). Such LPEs can degrade robustness to camera tilt or object reorientation and introduce additional memory and computational overhead. We propose $\textit{Spherical $\mathrm{SO}(3)$-Equivariant Local Attention}$ (SoLA), an LPE-free local attention mechanism for spherical signals. SoLA achieves full $\mathrm{SO}(3)$ equivariance through a distance-preserving positional modulation that couples query/key features with each token’s unit direction. Specifically, the modulation lifts queries and keys using an outer-product with the 4D direction dependent vector. The induced similarity of the modulated queries and keys depends on content affinity and great-circle distance while remaining invariant to global $\mathrm{SO}(3)$ rotations. The same formulation admits a softmax-free linear variant that computes local attention via key-value aggregation without per-query neighbor materialization. We integrate SoLA into a U-shaped spherical transformer for $360^\circ$ depth estimation and semantic segmentation, demonstrating substantially improved robustness to arbitrary 3D rotations compared to prior spherical transformers with similar computational costs.
Successful Page Load