VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
Abstract
Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows, as existing methods predominantly rely on pixel-based synthesis which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity; instead, we propose a new "Diagram-as-Code" paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing, and we present VCG-Bench, a unified benchmark for visual-centric mxGraph tasks comprising (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), and (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as mxGraph Execution Success Rate and Style Consistency Score (SCS), where experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.