Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism
Abstract
Large language model (LLM) benchmarks typically evaluate correctness through task-specific rules, but for structured outputs, correctness is often an equivalence problem. We study this issue in automated optimization modeling, where LLMs translate natural-language descriptions and numerical data into solver-ready formulations. Existing solver-based evaluation only approximates modeling correctness: it offers no formulation-level guarantee, is uninformative for infeasible cases, and can be expensive on hard mixed-integer linear programming. To address these limitations, We propose ORGEval, a graph-theoretic evaluation framework that represents optimization formulations as bipartite graphs and reduces equivalence checking to graph isomorphism. To make this practical and reliable, we introduce symmetric decomposability (SD), a checkable condition under which Weisfeiler-Lehman color equivalence is exactly equivalent to formulation isomorphism. ORGEval combines bipartite WL testing with efficient SD verification to provide provably correct equivalence evaluation for SD-certified instances. We further constructed Bench4Opt, a benchmark that separates models from data and contains SD-certified ground-truth instances. Experiments show that ORGEval reliably detects formulation equivalence while being substantially faster than solver-based evaluation, especially on challenging instances. Benchmarking state-of-the-art LLMs reveals that optimization modeling remains difficult, with the best model achieving only 54.82\% accuracy.