Compass-RoPE: Isotropic Rotary Position Embeddings for Vision Transformers
Abstract
Recent works introduce Rotary Position Embeddings (RoPE) into vision transformers (ViTs) to enhance their extrapolation capability, i.e., maintaining performance when inference is conducted on higher resolution images. RoPE encodes positions via rotating phases whose change is controlled by frequency components. Strandard 2D RoPE does not generalize well to input resolution changes as it only applies axial frequencies separately along each individual axis. To solve this issue, Mix-RoPE combines xy‑axis frequencies, such that it can model position relations in diagonal direction. However, in practice, we observe that the learned 2D frequencies become anisotropic in their direction distributions due to the axial spectral bias in image features, limiting the extrapolation ability of ViTs. Motivated by this observation, we propose Compass‑RoPE. We replace the xy cartesian coordinates with a polar parameterization that explicitly decouples frequency scale and angle. By initializing the angle vectors uniformly over [0,2π), it ensures the isotropic direction coverage. Besides, we further introduce discrete Fourier transform (DFT) mixing for the angle vectors, allowing each transformed individual angle vector element to nest multipule angles and thus to enrich angular expressiveness. Extensive experiments on multi-resolution classification and dense prediction tasks show that our Compass-RoPE achieves more stable extrapolation performance under large-scale resolution changes.