Timezone: »
Spotlight
Streaming and Distributed Algorithms for Robust Column Subset Selection
Shuli Jiang · Dongyu Li · Irene Mengze Li · Arvind Mahankali · David Woodruff
We give the first single-pass streaming algorithm for Column Subset Selection with respect to the entrywise $\ell_p$-norm with $1 \leq p < 2$. We study the $\ell_p$ norm loss since it is often considered more robust to noise than the standard Frobenius norm. Given an input matrix $A \in \mathbb{R}^{d \times n}$ ($n \gg d$), our algorithm achieves a multiplicative $k^{\frac{1}{p} - \frac{1}{2}}\poly(\log nd)$-approximation to the error with respect to the \textit{best possible column subset} of size $k$. Furthermore, the space complexity of the streaming algorithm is optimal up to a logarithmic factor. Our streaming algorithm also extends naturally to a 1-round distributed protocol with nearly optimal communication cost. A key ingredient in our algorithms is a reduction to column subset selection in the $\ell_{p,2}$-norm, which corresponds to the $p$-norm of the vector of Euclidean norms of each of the columns of $A$. This enables us to leverage strong coreset constructions for the Euclidean norm, which previously had not been applied in this context. We also give the first provable guarantees for greedy column subset selection in the $\ell_{1, 2}$ norm, which can be used as an alternative, practical subroutine in our algorithms. Finally, we show that our algorithms give significant practical advantages on real-world data analysis tasks.
Author Information
Shuli Jiang (Carnegie Mellon University)
Dongyu Li (Carnegie Mellon University)
Irene Mengze Li (Carnegie Mellon University)
Arvind Mahankali (Carnegie Mellon University)
David Woodruff (Carnegie Mellon University)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Streaming and Distributed Algorithms for Robust Column Subset Selection »
Fri. Jul 23rd 04:00 -- 06:00 AM Room
More from the Same Authors
-
2022 Poster: Sketching Algorithms and Lower Bounds for Ridge Regression »
Praneeth Kacham · David Woodruff -
2022 Poster: Learning Augmented Binary Search Trees »
Honghao Lin · Tian Luo · David Woodruff -
2022 Poster: Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time »
David Woodruff · Amir Zandieh -
2022 Spotlight: Sketching Algorithms and Lower Bounds for Ridge Regression »
Praneeth Kacham · David Woodruff -
2022 Spotlight: Learning Augmented Binary Search Trees »
Honghao Lin · Tian Luo · David Woodruff -
2022 Spotlight: Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time »
David Woodruff · Amir Zandieh -
2022 Poster: Bounding the Width of Neural Networks via Coupled Initialization - A Worst Case Analysis »
Alexander Munteanu · Simon Omlor · Zhao Song · David Woodruff -
2022 Spotlight: Bounding the Width of Neural Networks via Coupled Initialization - A Worst Case Analysis »
Alexander Munteanu · Simon Omlor · Zhao Song · David Woodruff -
2021 Poster: Single Pass Entrywise-Transformed Low Rank Approximation »
Yifei Jiang · Yi Li · Yiming Sun · Jiaxin Wang · David Woodruff -
2021 Poster: Fast Sketching of Polynomial Kernels of Polynomial Degree »
Zhao Song · David Woodruff · Zheng Yu · Lichen Zhang -
2021 Poster: Dimensionality Reduction for the Sum-of-Distances Metric »
Zhili Feng · Praneeth Kacham · David Woodruff -
2021 Spotlight: Fast Sketching of Polynomial Kernels of Polynomial Degree »
Zhao Song · David Woodruff · Zheng Yu · Lichen Zhang -
2021 Spotlight: Single Pass Entrywise-Transformed Low Rank Approximation »
Yifei Jiang · Yi Li · Yiming Sun · Jiaxin Wang · David Woodruff -
2021 Oral: Dimensionality Reduction for the Sum-of-Distances Metric »
Zhili Feng · Praneeth Kacham · David Woodruff -
2021 Poster: In-Database Regression in Input Sparsity Time »
Rajesh Jayaram · Alireza Samadian · David Woodruff · Peng Ye -
2021 Poster: Oblivious Sketching for Logistic Regression »
Alexander Munteanu · Simon Omlor · David Woodruff -
2021 Spotlight: Oblivious Sketching for Logistic Regression »
Alexander Munteanu · Simon Omlor · David Woodruff -
2021 Spotlight: In-Database Regression in Input Sparsity Time »
Rajesh Jayaram · Alireza Samadian · David Woodruff · Peng Ye -
2020 Poster: Input-Sparsity Low Rank Approximation in Schatten Norm »
Yi Li · David Woodruff