FairCore: Subgroup-Equitable Coreset Selection with Provable Fairness–Efficiency Tradeoffs
Kaustubh Bukkapatnam ⋅ Siddharth Karuturi ⋅ Soham Batra ⋅ Laksh Patel ⋅ Tanush A Shastry
Abstract
Coreset selection---choosing a small weighted subset of a training dataset that preserves loss curvature---is a cornerstone of resource-efficient machine learning. Existing methods optimise a global approximation objective, systematically under-allocating budget to minority demographic groups. We prove a bias-amplification lower bound: any global-objective coreset of size $m$ incurs per-group excess risk at least $\Omega(\sigma_{j}^{2}\rho_{j}/m)$ for group $j$, where $\rho_{j} = n/n_{j}$ is its imbalance ratio---so minority groups suffer degradation linear in their imbalance. We introduce \textsc{FairCore}, a group-aware coreset algorithm that simultaneously achieves $\epsilon$-global and $\delta$-per-group approximation with a coreset of size $O(kd/\delta^2 \cdot \log(kB/\delta))$, where $k$ is the number of groups and $d$ is the VC dimension of the loss class. We prove this size is minimax optimal in $k$ and $\delta$. Experiments on three Global South multilingual NLP benchmarks---AmericasNLP, MasakhaNER 2.0, and IndicGLUE---show \textsc{FairCore} raises worst-group accuracy by up to $+14.2$ pp over random sampling and $+6.5$ pp over the strongest baseline (GradMatch) at a 10\% coreset budget, with no statistically significant reduction in average accuracy ($p > 0.05$, paired t-test, 5 seeds).
Successful Page Load