Skip to yearly menu bar Skip to main content

Workshop: 8th ICML Workshop on Automated Machine Learning (AutoML 2021)

Towards Model Selection using Learning Curve Cross-Validation

Jan N. van Rijn


Cross-validation (CV) methods such as leave-one-out cross-validation, k-fold cross-validation, and Monte-Carlo cross-validation estimate the predictive performance of a learner by repeatedly training it on a large portion of the given data and testing on the remaining data. These techniques have two drawbacks. First, they can be unnecessarily slow on large datasets. Second, providing only point estimates, they give almost no insights into the learning process of the validated algorithm. In this paper, we propose a new approach for validation based on learning curves (LCCV). Instead of creating train-test splits with a large portion of training data, LCCV iteratively increases the number of training examples used for training. In the context of model selection, it eliminates models that can be safely dismissed from the candidate pool. We run a large scale experiment on the 67 datasets from the AutoML benchmark, and empirically show that LCCV in over 90\% of the cases leads to similar performance (at most 0.5\% difference) as 10-fold CV, but provides additional insights on the behaviour of a given model. On top of this, LCCV results in runtime reductions between 20% and over 50% on half of the 67 datasets from the AutoML benchmark. This can be incorporated in various AutoML frameworks, to speed up the internal evaluation of candidate models. As such, these results can be used orthogonal to other advances in the field of AutoML.