Skip to yearly menu bar Skip to main content


Poster

Near-Linear Time Approximation Algorithms for k-means with Outliers

Junyu Huang · Qilong Feng · Ziyun Huang · Jinhui Xu · Jianxin Wang


Abstract: The k-means problem is one of the most extensively studied clustering problems with various applications in unsupervised learning. Despite its popularity, the k-means objective is sensitive to outliers. In this paper, we study the k-means with outliers problem, where the goal is to discard up to $z$ outliers and identify a minimum k-means clustering on the remaining data points. Most previous results for this problem have running time dependent on the aspect ratio $\Delta$ ($\Delta$ is the ratio between the maximum pairwise distance and the minimum pairwise distance) to achieve fast approximation. Unfortunately, these dependencies can significantly limit the scalability of the algorithms for handling large-scale datasets. To address the aspect ratio dependency issue, we propose sampling-based algorithms with almost linear running time in the data size. A crucial component of our approach is an algorithm (called Fast-Sampling) which can find inliers that well approximate the optimal clustering centers, without guessing the optimal clustering cost. With this technique, a 4-approximation can be obtained in time $O(\frac{ndk\log\log n}{\epsilon^2})$ with $O(\frac{k}{\epsilon})$ centers opened and $(1+\epsilon)z$ outliers discarded. To reduce the number of centers opened, we propose a center reduction algorithm, which can find most mistakenly discarded inliers back during the sampling process. Based on the proposed algorithm, an $O(\frac{1}{\epsilon})$-approximate solution can be obtained in time $O(\frac{ndk\log \log n}{\epsilon^2} + \frac{dpoly(k)\log\Delta}{\epsilon^2})$ with $(1+\epsilon)z$ outliers discarded and exact k centers opened. Empirical experiments suggest that our proposed sampling-based algorithms outperform the state-of-the-art algorithms for the k-means with outliers problem. On average, the running time and clustering cost are reduced by at least 20\% and 5\%, respectively.

Live content is unavailable. Log in and register to view live content