Abstract:
One of the most popular clustering algorithms is the celebrated DαDα seeding algorithm (also know as kk-means++ when α=2α=2) by Arthur and Vassilvitskii (2007), who showed that it guarantees in expectation an O(22α⋅logk)O(22α⋅logk)-approximate solution to the (kk,αα)-clustering cost (where distances are raised to the power αα) for any α≥1α≥1. More recently, Balcan, Dick, and White (2018) observed experimentally that using DαDα seeding with α>2α>2 can lead to a better solution with respect to the standard kk-means objective (i.e. the (k,2)(k,2)-clustering cost). In this paper, we provide a rigorous understanding of this phenomenon. For any α>2α>2, we show that DαDα seeding guarantees in expectation an approximation factor of Oα((σmaxσmin)2−4/α⋅(gα⋅min{ℓ,logk})2/α)Oα((σmaxσmin)2−4/α⋅(gα⋅min{ℓ,logk})2/α) with respect to the standard kk-means cost of any underlying clustering; where gαgα is a parameter capturing the concentration of the points in each cluster, σmaxσmax and σminσmin are the maximum and minimum standard deviation of the clusters around their center, and ℓℓ is the number of distinct mixing weights in the underlying clustering (after rounding them to the nearest power of 22). For instance, if the underlying clustering is defined by a mixture of kk Gaussian distributions with equal cluster variance (up to a constant-factor), then our result implies that: (1) if there are a constant number of mixing weights, any constant α>2α>2 yields a constant-factor approximation; (2) if the mixing weights are arbitrary, any constant α>2α>2 yields an O(log2/αk)O(log2/αk)-approximation, and α=Θ(loglogk) yields an O(loglogk)3-approximation. We complement these results by some lower bounds showing that the dependency on gα and σmax/σmin is tight. Finally, we provide an experimental validation of the effects of the aforementioned parameters when using Dα seeding.
Chat is not available.