ICML Poster Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Poster

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Kyriakos Axiotis · Vincent Cohen-Addad · Monika Henzinger · Sammy Jerome · Vahab Mirrokni · David Saulpic · David Woodruff · Michael Wunder

Hall C 4-9 #915

[ Abstract ] [ Paper PDF ]

[ Poster]

Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on

k

$k$ -means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Holder continuous, our approach provably allows selecting a set of

typical''

k + 1 / ε^{2}

$k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative

(1 \pm ε)

$(1\pm\varepsilon)$ factor and an additive

ε λ Φ_{k}

$\varepsilon \lambda \Phi_k$ , where

Φ_{k}

$\Phi_k$ represents the

k

$k$ -means cost for the input embeddings and

λ

$\lambda$ is the Holder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performance of leverage score sampling, while being conceptually simpler and more scalable.

Chat is not available.