Skip to yearly menu bar Skip to main content


Poster

Sensitivity Sampling for Coreset-Based Data Selection

Kyriakos Axiotis · Vincent Cohen-Addad · Monika Henzinger · Sammy Jerome · Vahab Mirrokni · David Saulpic · David Woodruff · Michael Wunder


Abstract: We focus on data selection and consider the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selectionapproach based on $k$-means clustering and sensitivity sampling.Assuming embedding representation ofthe data and that the model loss is Hölder continuouswith respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' $k + 1/\epsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\epsilon)$ factor and an additive $\epsilon \lambda \Phi_k$, where $\Phi_k$ represents the $k$-means cost for the input data and $\lambda$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods.We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.

Live content is unavailable. Log in and register to view live content