OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
Abstract
We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical operations research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, thus providing a structured environment to probe the limits of automated problem formulation and solving. Our results reveal a sharp performance degradation as task complexity scale, highlighting a critical robustness gap in pure-text reasoning. While LLMs struggle with end-to-end solution generation, we demonstrate that tool-integrated reasoning provides a significantly more resilient path forward, regardless of model size. Furthermore, we identify the automated formulation of the constraints as the primary bottleneck. These insights clarify the limitations of current LLMs and provide a structured roadmap for developing next-generation LLMs for optimization modeling.