AffIn-Space: Learning Affine-Invariant Representations for 3D Spatial Understanding with MLLMs
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general visual understanding, they suffer from a fundamental geometric fragility: standard visual representations often degrade rapidly under changes in viewpoint and viewing distance. Our analysis identifies that existing paradigms, whether relying on input-level fusion or latent reconstruction, remain entangled with the view-dependent pixel grid, failing to decouple intrinsic 3D structure from extrinsic camera pose. To address this, we introduce AffIn-Space, a framework that enforces strict affine invariance to enable robust spatial understanding. Unlike implicit learning approaches, AffIn-Space introduces a two-stage explicit decoupling mechanism. First, it employs explicit geometric resampling by utilizing decomposed affine quantities (derived from pose features) to spatially align 3D features to a canonical state before fusion. Second, within the MLLM, we implement affine-invariant constraints via an orthogonal projection mechanism, which mathematically strips away pose-dependent noise from the hidden states while retaining recoverable geometric semantics through conditional reconstruction. Extensive experiments on VSI-Bench, ScanQA, SQA3D, Scan2Cap, and EmbodiedScan demonstrate that AffIn-Space achieves state-of-the-art performance. Code and detailed instructions will be publicly released. Crucially, our approach exhibits superior stability against affine perturbations, validating the effectiveness of explicitly modeling geometric invariance for complex spatial tasks. Code will be made available. Extensive experiments show that AffIn-Space achieves state-of-the-art performance on spatial reasoning tasks (VSI-Bench, SQA3D and Scan2Cap), and on spatial grounding tasks (ScanRefer and EmbodiedScan), demonstrating the effectiveness of affine invariant representations for complex spatial understanding.