Keywords: [ Statistical Learning Theory ]
Given a training data set, we develop a method to obtain a representative subset of training data meant for classification. Our focus is on subsets that are particularly suitable for training linear classification models. Our method is to reduce data sets in a manner that essentially preserves the convex hulls of the various classes of the data set. We develop scalable randomized algorithms for convex hull approximation, and obtain rigorous guarantees for models trained on subsets of linearly separable data sets. We empirically demonstrate the effectiveness of our approach for both training and data augmentation on the MNIST data set.