Stratified Coreset Selection for Data-Efficient Low-Resource African Language Sentiment Classification
Anubhav Kumar
Abstract
Coreset selection offers a principled approach to reducing annotation costs in low-resource natural language processing, yet whether standard strategies transfer to African languages under realistic class-imbalance conditions has never been tested. We present the first empirical investigation of coreset selection for sentiment classification in African languages, evaluating three strategies (random, uncertainty, and diversity sampling) across two languages spanning the low-to-mid resource spectrum: Swahili (1,810 training examples, heavily class-imbalanced) and Yoruba (8,522 training examples, moderately imbalanced). Using XLM-RoBERTa-base as the backbone, we run 72 experiments across four data fractions (10%--100%) and three random seeds, reporting macro-F1 as the primary metric throughout, as accuracy is unreliable under class imbalance. A central finding is a silent failure mode: without class-stratified selection, all strategies collapse to majority-class prediction, producing plausible accuracy while macro-F1 falls to $\approx 0.26$, which is invisible without the right metric. Stratified selection eliminates this collapse entirely. With stratification enforced, uncertainty sampling on Swahili at 50% data surpasses the full-data random baseline (macro-F1 0.543 vs. 0.524), and 50% of Yoruba training data recovers 92.5% of full-data macro-F1 (0.579 vs. 0.627). Strategy choice interacts with dataset size: uncertainty sampling wins on the severely low-resource language; random sampling dominates on the mid-resource one. These results yield concrete, resource-aware guidelines for Global South NLP practitioners operating under strict annotation budgets.
Successful Page Load