Timezone: »

Distributed and Provably Good Seedings for k-Means in Constant Rounds
Olivier Bachem · Mario Lucic · Andreas Krause

Tue Aug 08 01:30 AM -- 05:00 AM (PDT) @ Gallery #37

The k-Means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-Means|| algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-Means|| algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-Means|| provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-Means|| performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k-1 rounds are employed.

Author Information

Olivier Bachem (ETH Zurich)
Mario Lucic (ETH Zurich)
Andreas Krause (ETH Zurich)
Andreas Krause

Andreas Krause is a Professor of Computer Science at ETH Zurich, where he leads the Learning & Adaptive Systems Group. He also serves as Academic Co-Director of the Swiss Data Science Center and Chair of the ETH AI Center, and co-founded the ETH spin-off LatticeFlow. Before that he was an Assistant Professor of Computer Science at Caltech. He received his Ph.D. in Computer Science from Carnegie Mellon University (2008) and his Diplom in Computer Science and Mathematics from the Technical University of Munich, Germany (2004). He is a Max Planck Fellow at the Max Planck Institute for Intelligent Systems, an ELLIS Fellow, a Microsoft Research Faculty Fellow and a Kavli Frontiers Fellow of the US National Academy of Sciences. He received the Rössler Prize, ERC Starting Investigator and ERC Consolidator grants, the German Pattern Recognition Award, an NSF CAREER award as well as the ETH Golden Owl teaching award. His research has received awards at several premier conferences and journals, including the ACM SIGKDD Test of Time award 2019 and the ICML Test of Time award 2020. Andreas Krause served as Program Co-Chair for ICML 2018, and currently serves as General Chair for ICML 2023 and as Action Editor for the Journal of Machine Learning Research.

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors