WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views
SeongMin Jin ⋅ Doo Seok Jeong
Abstract
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale \textit{local} receptive fields centered on a given fixation point. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that WorldComp2D preserves fine-grained spatio-semantic representation of facial landmarks while achieving competitive accuracy at substantially low computational cost. Compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to $4.0\times$ and $2.2\times$, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://anonymous.4open.science/r/WorldComp2D-10F6/.
Successful Page Load