Fast-SAM3D: 3Dfy Anything in Images but Faster
Weilun Feng ⋅ Mingqiang Wu ⋅ zhiliang chen ⋅ Chuanguang Yang ⋅ Haotong Qin ⋅ Yuqi Li ⋅ Xiaokun Liu ⋅ Guoxin Fan ⋅ Zhulin An ⋅ Libo Huang ⋅ Yulun Zhang ⋅ Michele Magno ⋅ Yongjun Xu
Abstract
SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the **first systematic investigation** into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level **heterogeneity**: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present **Fast-SAM3D**, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) *Modality-Aware Step Caching* to decouple structural evolution from sensitive layout updates; (2) *Joint Spatiotemporal Token Carving* to concentrate refinement on high-entropy regions; and (3) *Spectral-Aware Token Aggregation* to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to **2.67$\times$** end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation.
Successful Page Load