Fast k-means Seeding Under The Manifold Hypothesis
Poojan Shah ⋅ Shashwat Agrawal ⋅ Ragesh Jaiswal
Abstract
We study beyond worst case analysis for the $k$-means problem where the goal is to model typical instances of $k$-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to $k$-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension $D$ concentrates around a low dimensional manifold of intrinsic dimension $d$, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent $\varepsilon = 2/d$ using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called $\operatorname{Qkmeans}$ which provides $O(\rho^{-2} \log k)$ approximate solutions to the $k$-means problem in time $O(nD) + \widetilde{O}(\varepsilon^{1+\rho}\rho^{-1}k^{1+\gamma})$; where the exponent $\gamma = \varepsilon + \rho$ for an input parameter $\rho < 1$. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.
Successful Page Load