Addressing Semantic Blind Spots in Text-to-SQL via Component Pre-generation and AST Matching Rewards
Abstract
In recent years, significant advancements in large language models have greatly propelled the development of Text-to-SQL tasks. However, due to the token-by-token sequential generation mechanism employed by these models, they encounter a semantic blind spot problem with respect to pending SQL components—the parts of the SQL query yet to be generated. Specifically, language models are unable to effectively utilize the semantic information of these pending SQL components during the generation of the final SQL query, which poses considerable challenges for generating complex SQL statements. To address this issue, we propose a novel thought process based on SQL components pre-generation and design a maximum connected subtree matching reward mechanism leveraging the SQL abstract syntax tree to improve the accuracy of local component generation. Extensive experiments demonstrate that, under comparable model parameter scales, our training approach achieves significant advantages, effectively enhancing the generation of complex SQL queries. Our method attains an execution accuracy EX of 65.78% on the BIRD-dev dataset and achieves state-of-the-art performance on the Spider-syn datasets.