Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Vision-Language Models
Chengcheng Wang ⋅ Jianyuan Guo ⋅ Hongguang Li ⋅ Yuchuan Tian ⋅ Ying Nie ⋅ Chang Xu ⋅ Kai Han
Abstract
Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and we prove that $\mathrm{PTD}=0$ is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to improve robustness by alternating RoPE variants across layers, and experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning.
Successful Page Load