MotionMAR: Multi-scale Auto-Regressive Human Motion Reconstruction from Sparse Observations
Abstract
Human motion inherently exhibits a sophisticated temporal hierarchical architecture, spanning from global low-frequency trajectories to local high-frequency dynamics. Inspired by this intrinsic property and the success of multi-scale autoregressive modeling in vision, we propose MotionMAR, a novel framework for human motion reconstruction from sparse observations. Unlike traditional methods, MotionMAR adopts a temporal coarse-to-fine design: it first estimates the global motion envelope and progressively refines temporal details for higher precision. The framework comprises three key components. First, a Temporal Multi-scale VQ-VAE defines hierarchical levels based on temporal resolutions, effectively disentangling global semantic information from fine-grained jitter. Second, the Motion Autoregressive Network (MAN) employs a next-scale prediction strategy: it generates coarse-scale indices to lock in the global structure, followed by finer-scale indices to restore details. This process is strictly guided by a Control Module that incorporates sparse tracking priors to ensure alignment with observed signals. Finally, a Motion Refinement Network functions as a temporal stabilizer on the decoded continuous pose space, mitigating quantization artifacts and smoothing local kinematics. Experiments demonstrate that MotionMAR achieves state-of-the-art accuracy, offering a principled, temporal-hierarchy-aware paradigm for robust motion reconstruction.