Poster Wed, Jul 8, 2026 • 1:00 AM – 2:45 AM PDT HALL A #4602

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

Xiaoyan Su ⋅ Peijie Dong ⋅ Zhenheng Tang ⋅ Song Tang ⋅ Yuyao Zhai ⋅ Kaitao Lin ⋅ Liang Chen ⋅ Gai Yuhang ⋅ Yuyu Luo ⋅ Qiang Wang ⋅ Xiaowen Chu

Abstract

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows, as existing methods predominantly rely on pixel-based synthesis which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity; instead, we propose a new "Diagram-as-Code" paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing, and we present VCG-Bench, a unified benchmark for visual-centric mxGraph tasks comprising (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), and (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as mxGraph Execution Success Rate and Style Consistency Score (SCS), where experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

Lay Summary

When engineers, researchers, or designers create flowcharts and system diagrams, they need outputs they can actually edit — moving a box, renaming a label, rerouting an arrow. We noticed that today's AI tools treat diagrams like photographs: they paint them pixel by pixel, which makes the results blurry, error-prone, and nearly impossible to modify without starting over. We wondered: what if AI worked with diagrams the way a programmer works with code — reading and writing structured text that describes the diagram, rather than drawing it? This lets edits be made precisely and cleanly, the same way you'd fix a typo in a document rather than repainting the page. To find out how well current AI systems handle this, we built a collection of 1,449 professional diagrams — spanning flowcharts, software blueprints, business plans, and UI mockups — and asked today's best AI models to reconstruct them from images and modify them based on plain-language instructions. We found that even the most capable models struggle significantly, revealing a clear gap between what these systems can do today and what real workflows demand.