STORM: Segment, Track, and Object Re-Localization from a Single Image
Abstract
Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking with minimal manual input and improved robustness. STORM introduces two mechanisms: (i) Hierarchical Spatial Fusion Attention (HSFA), which performs latent manifold alignment between reference and query features, guided by vision-language semantic conditioning to resolve instance ambiguities; and (ii) an energy-based failure detector to detect drift and trigger automatic re-initialization, yielding a self-healing tracker. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.