Boost the Identity-Preserving Embedding for Consistent Visual Generation
Abstract
Text-to-image models have advanced high-fidelity content generation, but their inability to maintain subject consistency hampers realistic applications. Existing training-based methods rely on heavy computation and large datasets; while training-free approaches demand excessive memory or complex auxiliary modules. In this paper, we first reveal a key property overlooked in prior works that the identity-relevant signals, termed Identity-Preserving Embeddings (IPemb), are implicitly encoded in textual embeddings of frame prompts. To address the consistent T2I generation with the IPemb embedding, we propose Boost Identity-Preserving Embedding (BIPE), a training-free yet plug-and-play framework that explicitly extracts and enhances the IPemb. Its core innovations are two complementary techniques: First, Adaptive Singular-Value Rescaling (adaSVR) applies singular-value decomposition to the joint embedding matrix of all frame prompts, amplifying identity-centric components while suppressing frame-specific noise. Second, Union Key (UniK) further reinforces consistency by aligning the T2I backbone’s image-text attention across the entire generation sequence. Experiments on the ConsiStory+ benchmark demonstrate BIPE outperforms existing methods in both qualitative and quantitative metrics. To address the gap in evaluating a broader range of scenarios with diversified prompt templates, we introduce a DiverStory benchmark to further confirm our scalability.