Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
What Data-Centric AI Can Do For k-means: a Faster, Robust kmeans-d
PARICHIT SHARMA · HASAN KURBAN · Mehmet Dalkilic
Data-centric AI (DCAI) is an emerging paradigm that prioritizes the quality, diversity, and representation of data over model architecture and hyperparameter tuning. DCAI emphasizes upstream data operations such as cleaning, balancing, and preprocessing, rather than solely focusing on model selection and optimization. This work aims to push DCAI into the model-building phase itself, observing whether benefits downstream can be as significant in a traditional, well studied algorithm like kmeans. We introduce data-centric kmeans (or, kmeans-d in short). kmeans-d is a novel adaptation of kmeans clustering that achieves significant speedups while preserving algorithmic accuracy. The key innovation classifies data points as high expressive (HE), impacting the objective function significantly, or low expressive (LE), with minimal influence. By categorizing data points as HE/LE, kmeans-d extracts quality signals to improve scalability and reduce computational overhead. Comprehensive experimental evaluation demonstrate substantial performance gains of kmeans-d over existing alternatives. The novelty lies in the pioneering integration of data-centric principles within a fundamental algorithm's iterative core. By kmeans through a data lens, kmeans-d delivers superior efficiency without sacrificing properties like accuracy and convergence, paving the way for infusing data-centricity into other canonical algorithms.