SAMT: Generating Structured Avatar Meshes and Textures from a Single Image
Abstract
Despite rapid progress in 3D generative models, producing production-grade 3D face assets from a single image remains challenging. To reconstruct facial micro-structures and fine-grained multiview-consistent textures, this work presents a two-stage framework named SAMT for monocular 3D avatar generation and texture synthesis. Specifically, a latent 3D diffusion model for facial mesh generation is pretrained and then further adapted to generate high-quality facial geometry through large-scale domain-specific fine-tuning on 35K curated 3D avatar models. Subsequently, a multiview-aware texturing strategy is proposed to texture the generated facial mesh. Its core idea lies in incorporating a multi-view facial prior, along with mesh geometry, to guide a 2D texturing diffusion for cross-view consistent and mesh-aligned texture synthesis. Extensive experiments demonstrate that SAMT outperforms existing approaches by producing more structured and detailed facial geometry, along with improved fine-grained appearance coherence.