Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs
Abstract
Unified multimodal models (UMMs) emerge as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing methods have matured for text-only models, a fundamental question remains unexplored: do knowledge edits that successfully modify textual outputs transfer to image generation for UMMs? To this end, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 3,005 instances across attribute edits and relation edits. We propose an automated VQA-based evaluation protocol to assess factual consistency between edited knowledge and generated images. Our evaluation reveals a striking modality gap: parameter-editing methods achieving high text-side efficacy (up to 93\%) fail to produce visual changes, with VQA accuracy below 6\% under direct generation. We propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation, improving visual verification to 10-27\% for attributes. Through mechanistic analysis, we identify the root cause: edit-affected pathways exhibit near-random overlap with visual attribute-conditioning channels, indicating a fundamental pathway mismatch. These findings demonstrate that textual knowledge edits do not guarantee cross-modality transfer, motivating future work on modality-aware editing methods.