A Random Matrix Analysis of Data Stream Clustering: Coping With Limited Memory Resources

Hugo Lebeau · Romain Couillet · Florent Chatelain

Hall E #611

Keywords: [ MISC: Unsupervised and Semi-supervised Learning ] [ APP: Time Series ] [ MISC: Online Learning, Active Learning and Bandits ] [ PM: Spectral Methods ] [ T: Miscellaneous Aspects of Machine Learning ] [ MISC: General Machine Learning Techniques ]

[ Abstract ]
[ Slides [ Poster [ Paper PDF
Wed 20 Jul 3:30 p.m. PDT — 5:30 p.m. PDT
Spotlight presentation: Deep Learning/MISC
Wed 20 Jul 1:30 p.m. PDT — 3 p.m. PDT

Abstract: This article introduces a random matrix framework for the analysis of clustering on high-dimensional data streams, a particularly relevant setting for a more sober processing of large amounts of data with limited memory and energy resources. Assuming data $\mathbf{x}_1, \mathbf{x}_2, \ldots$ arrives as a continuous flow and a small number $L$ of them can be kept in the learning pipeline, one has only access to the diagonal elements of the Gram kernel matrix: $\left[ \mathbf{K}_L \right]_{i, j} = \frac{1}{p} \mathbf{x}_i^\top \mathbf{x}_j \mathbf{1}_{\left\lvert i - j \right\rvert < L}$. Under a large-dimensional data regime, we derive the limiting spectral distribution of the banded kernel matrix $\mathbf{K}_L$ and study its isolated eigenvalues and eigenvectors, which behave in an unfamiliar way. We detail how these results can be used to perform efficient online kernel spectral clustering and provide theoretical performance guarantees. Our findings are empirically confirmed on image clustering tasks. Leveraging on optimality results of spectral methods for clustering, this work offers insights on efficient online clustering techniques for high-dimensional data.

Chat is not available.