StructMamPose: From Sequential Perception to Structural Reasoning for 3D Human Pose Estimation
Abstract
Accurately modeling complex temporal and topological dependencies and depth information is critical for monocular 3D human pose estimation, yet existing Mamba-based approaches struggle to fulfill these demands, suffering from internal state update confusion induced by forced sequence flattening and depth modeling confusion inherent to single-view observations. To address this confusion, we propose a StructMamPose framework equipped with Identity Anchoring Mechanism (IAM) and View Transformation Hub (VTH). The IAM injects spatiotemporal identities into the parameter generation network to anchor the selectivity of state update matrices, suppressing spurious connections to enforce feature propagation along valid topological dependencies. The VTH performs internal coordinate rotation to transform implicit depth inference into observable planar features, endowing the model with explicit spatial understanding and multi-view constraints. Experimental results demonstrate that our framework achieves SOTA performance on most datasets.