T-Edit: Triple-Branch Diffusion Anchoring for Consistent Editing
Abstract
While Multimodal Diffusion Transformers (MMDiTs) have achieved remarkable success in high-fidelity generation, maintaining semantic faithfulness and structural consistency during the image editing process remains a fundamental challenge. DiT-based editing is primarily limited by cumulative drift and semantic leakage induced by new textual conditions. To address these challenges, we propose T-Edit, a training-free framework that formalizes consistent editing as a trajectory anchoring process. T-Edit explicitly decouples the inversion, reconstruction, and editing trajectories, leveraging the reconstruction branch as a structural reference to achieve real-time compensation for deviations in the latent manifold. To further reveal the internal regulation mechanism of DiTs, we analyze the spatio-temporal heterogeneity of their layer-wise structural sensitivity and accordingly propose a Dynamic Vital Layer (DVL) localization mechanism based on information energy. Furthermore, addressing the asymmetry of textual perturbations in the frequency domain distribution, we introduce a frequency-aware strategy based on tensor Singular Value Decomposition (t-SVD) to anchor (TA) high-rank structural components. Experiments show that T-Edit achieves state-of-the-art performance in both semantic alignment and structural fidelity, and can be seamlessly extended to multi-step editing and video scenarios, providing a new perspective for understanding and controlling the internal stability of DiTs.