Poster Wed, Jul 8, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A

Boost the Identity-Preserving Embedding for Consistent Visual Generation

Zixun Xia ⋅ Shuyu Guo ⋅ Boqian Li ⋅ jian Yang ⋅ Kai Wang ⋅ Yaxing Wang

Abstract

Text-to-image models have advanced high-fidelity content generation, but their inability to maintain subject consistency hampers realistic applications. Existing training-based methods rely on heavy computation and large datasets; while training-free approaches demand excessive memory or complex auxiliary modules. In this paper, we first reveal a key property overlooked in prior works that the identity-relevant signals, termed Identity-Preserving Embeddings (IPemb), are implicitly encoded in textual embeddings of frame prompts. To address the consistent T2I generation with the IPemb embedding, we propose Boost Identity-Preserving Embedding (BIPE), a training-free yet plug-and-play framework that explicitly extracts and enhances the IPemb. Its core innovations are two complementary techniques: First, Adaptive Singular-Value Rescaling (adaSVR) applies singular-value decomposition to the joint embedding matrix of all frame prompts, amplifying identity-centric components while suppressing frame-specific noise. Second, Union Key (UniK) further reinforces consistency by aligning the T2I backbone’s image-text attention across the entire generation sequence. Experiments on the ConsiStory+ benchmark demonstrate BIPE outperforms existing methods in both qualitative and quantitative metrics. To address the gap in evaluating a broader range of scenarios with diversified prompt templates, we introduce a DiverStory benchmark to further confirm our scalability.