Skip to yearly menu bar Skip to main content


Poster
in
Workshop: AI for Science: Scaling in AI for Scientific Discovery

Scaling Up Diffusion and Flow-based XGBoost Models

Jesse Cresswell · Taewoo Kim

Keywords: [ Generative Models ] [ Diffusion ] [ physics ] [ Flow-matching ] [ Engineering ] [ Original Research Track ] [ tabular data ]


Abstract: Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370$\times$ larger than previously used. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge.

Chat is not available.