Skip to yearly menu bar Skip to main content


Poster

Analyzing DαDα seeding for kk-means

Etienne Bamas · Sai Ganesh Nagarajan · Ola Svensson

Hall C 4-9 #2216

Abstract: One of the most popular clustering algorithms is the celebrated DαDα seeding algorithm (also know as kk-means++ when α=2α=2) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an O(22αlogk)O(22αlogk)-approximate solution to the (kk,αα)-clustering cost (where distances are raised to the power αα) for any α1α1. More recently, Balcan, Dick, and White (2018) observed experimentally that using DαDα seeding with α>2α>2 can lead to a better solution with respect to the standard kk-means objective (i.e. the (k,2)(k,2)-clustering cost). In this paper, we provide a rigorous understanding of this phenomenon. For any α>2α>2, we show that DαDα seeding guarantees in expectation an approximation factor of Oα((σmaxσmin)24/α(gαmin{,logk})2/α)Oα((σmaxσmin)24/α(gαmin{,logk})2/α) with respect to the standard kk-means cost of any underlying clustering; where gαgα is a parameter capturing the concentration of the points in each cluster, σmaxσmax and σminσmin are the maximum and minimum standard deviation of the clusters around their center, and is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of 22). For instance, if the underlying clustering is defined by a mixture of kk Gaussian distributions with equal cluster variance (up to a constant-factor), then our result implies that: (1) if there are a constant number of mixing weights, any constant α>2α>2 yields a constant-factor approximation; (2) if the mixing weights are arbitrary, any constant α>2α>2 yields an O(log2/αk)O(log2/αk)-approximation, and α=Θ(loglogk) yields an O(loglogk)3-approximation. We complement these results by some lower bounds showing that the dependency on gα and σmax/σmin is tight. Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα seeding.

Chat is not available.