LakeQA: A Benchmark for Complex Exploratory QA over a Million-Scale Data Lake
Abstract
Recent large language models (LLMs) have shown rapid progress on reading-based question answering (QA), where the evidence is explicitly provided or trivially retrievable. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in a massive collection of data lakes, necessitating searching as a prerequisite for answering. However, there is a lack of a comprehensive benchmark that requires searching and reasoning over a large collection of data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes \emph{searching} and \emph{reasoning} capabilities. LakeQA is built on a heterogeneous collection of ~9.5 TB text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure the quality of LakeQA's tasks, each sample is annotated by at least one Ph.D level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct document(s) and then compose evidence across sources to produce the answer. Intensive experiment results on seven frontier LLMs have demonstrated that LakeQA is challenging. For instance, GPT-5.2 only obtains an exact matching score of 14.73% on LakeQA. Overall LakeQA provides a realistic testbed for developing LLM agents that can both \emph{find} and \emph{analyze} data in modern data lakes.