SceneDirector: Bridging Explicit Geometry and Generative Priors for Unified Driving Scene Editing
Abstract
Validating autonomous driving systems requires diverse scenarios, yet real-world data collection is biased and costly. Editing existing driving logs offers a scalable solution, but simultaneously editing objects and ego-trajectory—termed unified editing—remains challenging. Current methods face an inherent dilemma: generative flexibility for object editing and physical precision for trajectory control. To address this, we introduce SceneDirector, a diffusion-based framework that bridges explicit geometry and generative priors. For explicit geometry, we leverage LiDAR-guided depth completion to construct dense scene geometry and integrate editable 3D assets to form a Unified Geometric Scaffold, providing rigorous structural guidance for unified editing. To leverage generative priors, we encode the source video into a Static Texture Bank to provide rich appearance context. Our proposed Mask-Gated Reference Attention bridges these modalities. Guided by a geometric uncertainty metric, this mechanism dynamically regulates the interaction between the scaffold and the bank—preserving reliable geometry while adaptively injecting textures for semantic refinement. Extensive evaluations demonstrate that SceneDirector outperforms state-of-the-art methods in both controllability and visual quality.