Poster Thu, Jul 9, 2026 • 1:00 AM – 2:45 AM PDT HALL A #2210

Addressing Semantic Blind Spots in Text-to-SQL via Component Pre-generation and AST Matching Rewards

Xingyu Ma ⋅ Xin Tian ⋅ Lingxiang Wu ⋅ Xuepeng Wang ⋅ Zhilin Zhang ⋅ Xueming Tang ⋅ Jinqiao Wang

Abstract

In recent years, significant advancements in large language models have greatly propelled the development of Text-to-SQL tasks. However, due to the token-by-token sequential generation mechanism employed by these models, they encounter a semantic blind spot problem with respect to pending SQL components—the parts of the SQL query yet to be generated. Specifically, language models are unable to effectively utilize the semantic information of these pending SQL components during the generation of the final SQL query, which poses considerable challenges for generating complex SQL statements. To address this issue, we propose a novel thought process based on SQL components pre-generation and design a maximum connected subtree matching reward mechanism leveraging the SQL abstract syntax tree to improve the accuracy of local component generation. Extensive experiments demonstrate that, under comparable model parameter scales, our training approach achieves significant advantages, effectively enhancing the generation of complex SQL queries. Our method attains an execution accuracy EX of 65.78% on the BIRD-dev dataset and achieves state-of-the-art performance on the Spider-syn datasets.

Lay Summary

Many people need to find information stored in databases, but writing SQL queries requires specialized knowledge. Text-to-SQL systems aim to solve this problem by allowing users to ask questions in natural language and automatically converting them into database queries. However, current language models often have difficulty with complex questions because they generate queries from left to right and may not fully consider important query parts that will appear later. In this paper, we propose a method that encourages the model to first prepare meaningful parts of the query before producing the final SQL query. We also introduce a training signal that helps the model learn whether these parts are structurally correct. Our experiments show that this approach improves the accuracy of query generation, particularly for complex database questions.