Skip to yearly menu bar Skip to main content


Poster

Massively Parallel k-Means Clustering for Perturbation Resilient Instances

Vincent Cohen-Addad · Vahab Mirrokni · Peilin Zhong

Hall E #1106

Keywords: [ OPT: Large Scale, Parallel and Distributed ] [ MISC: Unsupervised and Semi-supervised Learning ] [ MISC: Scalable Algorithms ] [ Theory ]


Abstract: We consider k-means clustering of n data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. Recent work provides evidence that getting O(1)-approximate k-means solution for general input points using o(logn) rounds in the MPC model may be impossible under certain conditions [Ghaffari, Kuhn \& Uitto'2019]. However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial'2010]. In particular, a point set is α-perturbation resilient for k-means if perturbing pairwise distances by multiplicative factors in the range [1,α] does not change the optimum k-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing o(logn) rounds k-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable (1+ε)-approximate k-means clustering algorithm for O(α)-perturbation resilient instance in the MPC model using O(1) rounds and Oε,d(n1+1/α2+o(1)) total space. If the space per machine is sufficiently larger than k, i.e., at least knΩ(1), we also develop an optimal k-means clustering algorithm for O(α)-perturbation resilient instance in MPC using O(1) rounds and Od(n1+o(1)(n1/α2+k)) total space.

Chat is not available.