Seeing the Unseen: Physics-as-Representation for Generalizable Gaze Perception
Abstract
We introduce physics-as-representation, a learning paradigm that encodes physical structure and geometric laws into visual representations, enabling models to see the unseen—the underlying 3D geometry and motion dynamics not apparent in raw pixels. We instantiate this paradigm in gaze perception by proposing SG-Gaze, a framework that learns a Structurally and Geometrically Consistent Representation (SGR) through dual-branch adversarial learning. An analytical branch embeds appearance features onto a spherical manifold aligned with gaze geodesics, while a model-guided branch reconstructs the 3D eyeball with weak 2D edge supervision. We further introduce View-Consistent Regularization, which augments SGR learning with synthetic view perturbations and enforces rotation-equivariant consistency across gaze vectors and structural projections, eliminating the need for multi-view calibration or explicit 3D labels. Extensive experiments across 12 challenging cross-domain transfers demonstrate that SG-Gaze achieves state-of-the-art accuracy and strong generalization. Our work highlights that enforcing structural and geometric consistency with equivariant regularization serves as effective inductive biases for interpretable and generalizable representation learning—a step toward machines that perceive the world not only from pixels, but from physics.