In-Context Generation with Regional Constraints for Instructional Video Editing
Abstract
The In-context generation paradigm has demonstrated strong power in instructional image editing for better synthesis quality. Nevertheless, shaping such in-context learning for instructional video editing is not trivial. Without specifying editing regions, the results can suffer from the issue of inaccurate editing regions and the token interference between different areas. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into Regional Constraint modeling between editing and non-editing areas. Technically, ReCo width-wise concatenates source and target video for joint denoising. In model training, ReCo formulates regional constraints with two regularization terms, i.e., latent and attention regularization, on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing editing area modification and alleviating unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs. Extensive experiments conducted on four major instruction-based video editing tasks verify the superiority of ReCo.