From Diagrams to Code: Multilingual Programming with Visual Design
Abstract
In modern software development, particularly in emerging ``vibe coding'' paradigms, project implementation increasingly begins with visual interactions between users and AI coding assistants, where system architectures are communicated through visual designs before coding. This visual-first approach necessitates AI systems capable of interpreting diagrams across multiple programming languages. However, the development of such systems is severely hindered by the lack of large-scale multimodal training data and evaluation benchmarks. To address these limitations, we present M2C-INSTRUCT, a comprehensive multilingual multimodal instruction-tuning dataset containing over 13.1M samples across 50+ programming languages, designed for visual understanding and diagram interpretation in code generation tasks. We validate our dataset by training M2-CODER, a multilingual multimodal software developer that successfully integrates visual design inputs with textual instructions. We also introduce M2EVAL, a novel multilingual evaluation benchmark for multimodal code generation performance. Experiments show our 7B M2-CODER performs on par with much larger 70B+ models, confirming the quality and effectiveness of our M2C-INSTRUCT. Together, M2C-INSTRUCT, M2-CODER, and M2EVAL provide essential infrastructure for visual-assisted programming in vibe-coding and visual-interactive development workflows.