Beyond Text-to-SQL: Can LLMs Really Debug Enterprise ETL SQL?
Jing Ye ⋅ Yiwen Duan ⋅ Yonghong Yu ⋅ Victor Ma ⋅ Yang Gao ⋅ Xing Chen
Abstract
SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult—even for experienced developers and advanced \ttsql LLMs—often requiring multiple debugging iterations. We introduce \textbf{\ourbench}, the first benchmark for enterprise-level SQL reasoning and debugging. Our benchmark is built upon two key innovations: (1) an \textbf{automated construction workflow} that employs reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an \textbf{execution-free evaluation framework} tailored for enterprise settings, providing fast, accurate, and resource-efficient assessment. \ourbench comprises $469$ \ourbenchsyn queries featuring syntax errors with explicit error messages, and $516$ \ourbenchsem queries targeting semantic errors where codes fails to meet user intent. The queries are highly complex, averaging over $140$ lines, and featuring deep and wide abstract syntax trees (average width $>11$, depth $>8.7$). Evaluation of nearly $30$ LLMs reveals a substantial performance gap: the best-performing model, Claude-4-Sonnet, achieves only $36.46\%$ accuracy on \ourbenchsyn and $32.17\%$ on \ourbenchsem, while most models score below $20\%$. We further explore four solution strategies, identify key challenges, and outline promising directions for enterprise SQL debugging with LLMs.
Successful Page Load