Approximation Preserving Coresets
Milind Prabhu ⋅ Chris Schwiegelshohn ⋅ Sudarshan Shyam
Abstract
Clustering in a big data setting is an intensively studied problem, with coresets emerging as one of the important paradigms in this line of work. Given a cost function $\text{cost}(P,S)$ mapping input points $P$ and a solution $S$ to an objective value, a coreset is a typically weighted subset $\Omega\subseteq P$ such that $\text{cost}(\Omega,S)\approx \text{cost}(P,S)$. For example, the Euclidean $k$-means problem, arguably the most widely studied problem in this line of work, admits a coreset of size $\tilde{O}(k\varepsilon^{-2}\min(\sqrt{k},\varepsilon^{-2}))$ points while preserving the $k$-means cost for \emph{any} candidate solution up to a $(1\pm \varepsilon)$ factor [CLSSS NeurIPS 2022]. While this bound is reasonably small, most empirical work on coresets suggest that smaller coreset sizes are sufficient. In this paper, we offer an explanation of this phenomenon. We show that a coreset size of $\tilde{O}(k \varepsilon^{-3})$ is sufficient that retains the approximation guarantee up to a $(1+\varepsilon)$ factor of any approximation algorithm used to compute a solution. These \emph{approximation preserving coresets} have a weaker guarantee than that of strong coresets, which apply to all solutions, while having stronger guarantees than weak coresets which only apply to the optimum solution. Thus, in some sense, worst case solutions inducing large strong coresets are solutions that most reasonable algorithms will not consider. We further extend the notion of approximation preserving coresets to \emph{arbitrary} metrics, showing that the approximation guarantee can be retained up to a factor $4+\varepsilon$ with a coreset of size $\tilde{O}(k\varepsilon^{-2})$. We complement this result by showing that a very small distortion on the approximation factor cannot admit coresets of this size. Our implementation with popular approximation algorithms such as $k$-means++ and local search confirm our theoretical findings also in practice.
Successful Page Load