Regression-Stratified Sampling for Optimized Algorithm Selection in Time-Constrained Tabular AutoML
Abstract
The selection of a machine-learning (ML) algorithm is indispensable for tabular AutoML training. Finding an optimized algorithm from a search space can be expensive for large tabular datasets, especially under time constraints. In this study, we introduce a novel Regression-Stratified Sampling approach that optimizes algorithm selection by minimizing distribution distance between a subset of data and the target variable(s) in the full-scale dataset via Probability Density Function (PDF). Additionally, we introduce a PDF Energy metric, based on relative entropy, to identify an optimized ML algorithm from the search space.Our comprehensive evaluation results demonstrate that the proposed approach successfully selects optimized algorithms from a search space of atomic and ensemble models, outperforming simple random sampling methods. We also conduct a thorough evaluation against Kullback-Leibler (KL) divergence, where the PDF Energy metric proves superior in algorithm selection.Furthermore, we validate our approach for ML algorithm selection in an end-to-end scenario across 31 public datasets using 6 tabular AutoML tools. The empirical results indicate that our proposed method efficiently utilizes Regression-Stratified Sampling and reliably identifies an optimized machine learning algorithm for tabular data through the PDF Energy metric under time constraints.