Skip to yearly menu bar Skip to main content


Poster

Deletion-Anticipative Data Acquisition

Rachael Hwee Ling Sim · Jue Fan · Xiao Tian · Bryan Kian Hsiang Low · Patrick Jaillet


Abstract: Supervised data subset selection and active learning have often been used in data acquisition to select a smaller training set and reduce the time and cost of training _machine learning_ (ML) models. These methods assume the selected training set to remain available throughout the ML model deployment. Such an assumption is increasingly being challenged as data owners, enabled by the GDPR's right to erasure, request deletions of their data. This raises an important question: _During data acquisition of a training set of size $k$, how can a learner proactively maximize the data utility after future unknown deletions?_ We propose that the learner anticipates/estimates the probability that (i) each data owner in the feasible set will independently delete its data or (ii) a number of deletions occur out of $k$, and justify our proposal with concrete real-world use cases. Then, instead of directly maximizing the data utility function, the learner should maximize the expected or risk-averse utility based on the anticipated probabilities. We further propose how to construct these _deletion-anticipative data selection_ ($\texttt{DADS}$) maximization objectives to preserve properties like monotone submodularity and near-optimality of greedy solutions, optimize the objectives efficiently, and empirically evaluate $\texttt{DADS}$' performance on real-world datasets.

Live content is unavailable. Log in and register to view live content