Explainable k-Means and k-Medians Clustering

Michal Moshkovitz · Sanjoy Dasgupta · Cyrus Rashtchian · Nave Frost

Keywords: [ Clustering ] [ Learning Theory ] [ Unsupervised and Semi-supervised Learning ] [ Accountability, Transparency and Interpretability ]

[ Abstract ] [ Join Zoom
Please do not share or post zoom links


Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

Chat is not available.