Skip to yearly menu bar Skip to main content


Session

Clustering 1

Abstract:
Chat is not available.

Wed 11 July 2:00 - 2:20 PDT

Equivalence of Multicategory SVM and Simplex Cone SVM: Fast Computations and Statistical Theory

Guillaume Pouliot

The multicategory SVM (MSVM) of Lee et al. (2004) is a natural generalization of the classical, binary support vector machines (SVM). However, its use has been limited by computational difficulties. The simplex-cone SVM (SCSVM) of Mroueh et al. (2012) is a computationally efficient multicategory classifier, but its use has been limited by a seemingly opaque interpretation. We show that MSVM and SCSVM are in fact exactly equivalent, and provide a bijection between their tuning parameters. MSVM may then be entertained as both a natural and computationally efficient multicategory extension of SVM. We further provide a Donsker theorem for finite-dimensional kernel MSVM and partially answer the open question pertaining to the very competitive performance of One-vs-Rest methods against MSVM. Furthermore, we use the derived asymptotic covariance formula to develop an inverse-variance weighted classification rule which improves on the One-vs-Rest approach.

Wed 11 July 2:20 - 2:30 PDT

Quickshift++: Provably Good Initializations for Sample-Based Mean Shift

Heinrich Jiang · Jennifer Jang · Samory Kpotufe

We provide initial seedings to the Quick Shift clustering algorithm, which approximate the locally high-density regions of the data. Such seedings act as more stable and expressive cluster-cores than the singleton modes found by Quick Shift. We establish statistical consistency guarantees for this modification. We then show strong clustering performance on real datasets as well as promising applications to image segmentation.

Wed 11 July 2:30 - 2:40 PDT

Hierarchical Clustering with Structural Constraints

Evangelos Chatziafratis · Rad Niazadeh · Moses Charikar

Hierarchical clustering is a popular unsuperviseddata analysis method. For many real-world applications,we would like to exploit prior informationabout the data that imposes constraintson the clustering hierarchy, and is not capturedby the set of features available to the algorithm.This gives rise to the problem of hierarchicalclustering with structural constraints. Structuralconstraints pose major challenges for bottom-upapproaches like average/single linkage and eventhough they can be naturally incorporated intotop-down divisive algorithms, no formal guaranteesexist on the quality of their output. In thispaper, we provide provable approximation guaranteesfor two simple top-down algorithms, usinga recently introduced optimization viewpoint ofhierarchical clustering with pairwise similarity information(Dasgupta, 2016). We show how to findgood solutions even in the presence of conflictingprior information, by formulating a constraint-basedregularization of the objective. Furthemore, weexplore a variation of this objective for dissimilarityinformation (Cohen-Addad et al., 2018) andimprove upon current techniques. Finally, we demonstrate our approach on a real dataset for the taxonomy application.

Wed 11 July 2:40 - 2:50 PDT

K-means clustering using random matrix sparsification

Kaushik Sinha

K-means clustering algorithm using Lloyd's heuristic is one of the most commonly used tools in data mining and machine learning that shows promising performance. However, it suffers from a high computational cost resulting from pairwise Euclidean distance computations between data points and cluster centers in each iteration of Lloyd's heuristic. Main contributing factor of this computational bottle neck is a matrix-vector multiplication step, where the matrix contains all the data points and the vector is a cluster center. In this paper we show that we can randomly sparsify the original data matrix resulting in a sparse data matrix which can significantly speed up the above mentioned matrix vector multiplication step without significantly affecting cluster quality. In particular, we show that optimal k-means clustering solution of the sparse data matrix, obtained by applying random matrix sparsification, results in an approximately optimal k-means clustering objective of the original data matrix. Our empirical studies on three real world datasets corroborate our theoretical findings and demonstrate that our proposed sparsification method can indeed achieve satisfactory clustering performance.

Wed 11 July 2:50 - 3:00 PDT

Clustering Semi-Random Mixtures of Gaussians

Aravindan Vijayaraghavan · Pranjal Awasthi

Gaussian mixture models (GMM) are the most widely used statistical model for the k-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural robust model for k-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for k-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching lower bound on the number of misclassified points incurred by any k-means clustering algorithm on the semi-random model.