ICML Poster Average Sensitivity of Hierarchical $k$-Median Clustering

Poster

Average Sensitivity of Hierarchical $k$-Median Clustering

Shijie Li · Weiqiang He · Ruobing Bai · Pan Peng

West Exhibition Hall B2-B3 #W-1016

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is sensitive to small perturbations of the dataset, the usability of the algorithm will be greatly reduced. In this paper, we focus on the hierarchical $k$ -median clustering problem, which bridges hierarchical and centroid-based clustering while offering theoretical appeal, practical utility, and improved interpretability. We analyze the average sensitivity of algorithms for this problem by measuring the expected change in the output when a random data point is deleted. We propose an efficient algorithm for hierarchical $k$-median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments.

Lay Summary: Hierarchical clustering is a popular tool that helps computers find structure in large datasets — like grouping similar customers or organizing search results. But in real-world situations, data can change: a user leaves, a sensor fails, or a few points are updated. If a clustering algorithm changes too much from small tweaks, it becomes unreliable.Our research looks at a version of clustering called **hierarchical $k$-median**, which is both useful and interpretable. We ask: _How stable are these algorithms when a single data point is removed at random?_ We measure this with a concept called **average sensitivity**— how much the output shifts due to small changes.We design a new algorithm that produces high-quality clusters while remaining stable, meaning small changes in data don’t lead to big changes in output. We also show that some commonly used methods are much less stable. Our algorithm not only works well in theory but also performs reliably in practice.

Chat is not available.