CacheEdit: Efficient Multi-round Image Editing via Adaptive Token-wise Reuse.
Jinxin Yu ⋅ Xueqing Chen ⋅ Yudong Pan ⋅ Lian Liu ⋅ shengwen Liang ⋅ Huawei Li ⋅ Xiaowei Li ⋅ ying wang
Abstract
Instruction-based image editing (IIE) is a vital tool for iterative content creation, enabling multi-round interactions that refine visual details while preserving cross-round consistency. However, this workflow is constrained by the compute-bound nature of Diffusion Transformers (DiTs): because DiTs process tokens uniformly, they waste substantial computation on regions untouched by the instruction. We investigate the Round--Step--Layer hierarchy of DiT-based editing and identify a phenomenon we term Delayed Latent Emergence (DLE). Although pronounced latent changes emerge only in the late denoising stages, deep-layer activations within transformer blocks at the very first sampling step already diverge markedly in edited regions. Building on this insight, we propose CacheEdit, a training-free framework centered on an Adaptive Activation Cache (Acache) that exploits early-step sensitivity to detect invariant tokens and reuse their cached activations across subsequent sampling steps, thereby bypassing redundant computation. Experiments on FLUX.1 Kontext and Qwen-Image-Edit show that CacheEdit achieves up to $2.5\times$ end-to-end acceleration. Moreover, by isolating and reusing static features, CacheEdit mitigates stochastic drift and improves instruction-following and structural consistency over full-recomputation baselines.
Successful Page Load