DMCO: Budget-Aware Co-Optimization of Data Cleaning and AutoML
Abstract
Data cleaning and automated machine learning (AutoML) are both crucial for reliable learning systems, yet are commonly treated as independent or sequential stages. This separation ignores their strong interaction and leads to inefficient use of limited computational budgets. We propose DMCO, a unified framework that jointly optimizes data cleaning and model construction under a fixed resource budget. DMCO reformulates the traditional two-stage pipeline into a time-sliced process, where data cleaning and AutoML are interleaved and adaptively scheduled. We introduce a gradient-based data cleaning sampling strategy with theoretical guarantees for minimizing gradient estimation variance, and integrates it with loss-driven sampling and progressive AutoML fitting to continuously leverage intermediate data quality improvements. Experiments on six real-world datasets show that DMCO consistently outperforms standalone data cleaning and AutoML baselines on both classification and regression tasks, as measured by F1 score and MSE. Under limited budgets, DMCO achieves up to 82.19\% of the performance of full data cleaning with exhaustive AutoML, while remaining robust across different AutoML frameworks.