Relational In-Context Learning via Synthetic Pre-training with Structural Prior
Abstract
Relational Databases (RDBs) are the backbone of modern business, yet they have missed the Foundation Model revolution. Unlike text or images, high-quality RDB data is private and scarce, rendering the standard approach of ``pre-training on the internet'' infeasible. Consequently, existing solutions typically rely on limited real-world datasets, requiring costly fine-tuning to achieve viable performance. To overcome this data scarcity, we introduce RDB-PFN, the first foundation model for databases trained purely on synthetic data. Drawing inspiration from Prior-Data Fitted Networks (PFNs) where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on i.i.d. single tables, we construct a novel Relational Prior Generator to create an infinite stream of random, complex, and diverse database schemas from scratch. By pre-training on a large-scale curriculum of over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine In-Context Learning. Experiments demonstrate that RDB-PFN outperforms both fine-tuned Graph Foundation Models and state-of-the-art Single-Table Foundation Models on real-world benchmarks. Notably, these results are achieved using a naive model architecture, proving that a rigorously defined synthetic generator is all you need to solve relational reasoning.