DiasR: Dual-Modal Identity-Anchored Sparse Routing for Efficient Multi-Subject Video Generation
Abstract
Personalized multi-subject video generation is a promising direction within the field of controllable video generation; however, existing methods face challenges in maintaining cross-frame identity consistency and incur high computational overhead. To address these issues, we propose DiasR, an efficient framework that integrates Dual-Modal Identity-Anchored Alignment and a novel Sparse Routing Strategy. The Dual-Modal Identity-Anchored Alignment employs learnable identity queries to align visual and textual modalities with ground-truth subject masks, thereby mitigating cross-frame identity drift. The Sparse Routing Strategy dynamically routes video tokens to relevant subjects and groups them through bucket aggregation, reducing computational overhead and alleviating identity entanglement induced by redundant tokens. We have also constructed MuSA-2M, a large-scale dataset comprising 2 million annotated samples equipped with subject-level masks, which fills the gap in existing multi-subject video datasets. Experiments conducted on the OpenS2V-Eval benchmark demonstrate that our method achieves superior performance in identity consistency, text fidelity, and video naturalness. Notably, it maintains a nearly constant inference time as the number of reference subjects increases, outperforming existing baselines in both efficiency and generation quality for scenarios involving multi-subject interactions.