UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
Abstract
Medical diagnosis demands models that can process multimodal medical inputs, such as medical images and patient histories, and generate diverse outputs including textual reports and visual content, such as annotations or segmentation masks. Despite this need, existing medical AI models disrupt this unified process: image understanding models interpret images without producing visual outputs, while image generation models produce visual outputs but cannot provide textual explanations. Therefore, we propose a multi-level framework called Observation-Knowledge-Analysis (OKA) to unify the distinct processes. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs. At the knowledge level, we propose Progressive Curriculum Learning, where models simultaneously learn medical multimodal understanding and generation knowledge from UniMed-5M.At the analysis level, we introduce UniMedVL, the first medical unified multimodal model that unifies image understanding and generation within a single architecture without manually reloading model checkpoints. UniMedVL achieves superior performance on 5 medical image understanding benchmarks, while matching specialized models in generation quality across 8 medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing, improving performance on both image understanding and generation tasks. Code is available at https://anonymous.4open.science/r/Uni-MedVL-65F2/README.md.