Timezone: »

Explainable k-Means and k-Medians Clustering
Michal Moshkovitz · Sanjoy Dasgupta · Cyrus Rashtchian · Nave Frost

Tue Jul 14 10:00 AM -- 10:45 AM & Tue Jul 14 11:00 PM -- 11:45 PM (PDT) @ None #None

Many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the k-means and k-medians objectives. In terms of negative results, we show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and any clustering based on a tree with k leaves must incur an Omega(log k) approximation factor compared to the optimal clustering. On the positive side, for two means/medians, we show that a single threshold cut can achieve a constant factor approximation, and we give nearly-matching lower bounds; for general k > 2, we design an efficient algorithm that leads to an O(k) approximation to the optimal k-medians and an O(k^2) approximation to the optimal k-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.

Author Information

Michal Moshkovitz (University of California San Diego)
Sanjoy Dasgupta (UC San Diego)
Cyrus Rashtchian (UCSD)
Nave Frost (Tel-Aviv University)

More from the Same Authors