Paper ID: 533
Title: A Simple and Provable Algorithm for Sparse Diagonal CCA

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a stochastic optimization algorithm to solve sparse CCA problem. The algorithm can solve the sparse CCA problem with a linear complexity for ambient dimension, and provide a theorem to prove that its solution is approximately optimal to the low-rank CCA objective. 

Clarity - Justification: The main part of the paper is readable. I have not carefully check the theory provided in 2.3, while the theoretical results look reasonable.

Significance - Justification: The proposed algorithm is fast and more accurate, while the circle on sparse CCA research might not be too large. I thus still think this work is an incremental advance in the perspective of the entire machine learning domain.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed algorithm is a stochastic optimization algorithm, and thus it is easy to understand that it is fast. The convergence result is helpful to support the effectiveness of the algorithm, while the iteration number is generally hard to attain if an accurate solution (small epslo) is needed. I suggest authors presenting more clarifications on this point.  Some expressions in the paper seem a little strange. E.g., in line 166, what is “the latter”? Also in this paragraph, “for any allowable time window” might be better to revised.   The contribution part might be too long. A brief listing of contributions might make readers more impressive and catch the idea more easily.  The presented parallelization strategy might be a little trivial. Any algorithm for solving a non-convex problem can utilize this way: running algorithm under different initializations in parallel and pick up the best one as the final solution. Thus this might not be a contribution of this work.  Since the proposed algorithm introduces another parameter (rank), how to properly tune it in experiments? Is it sensitive to the performance? Please clarify more on this point.  Besides PMD, there are many other methods proposed for sparse PCA/CCA tasks. It might be better to compare more methods to make the advantage of the proposed method more convincible.   In line 766, “algorith” should be “algorithm”. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an approximate algorithm for solving the sparse (unnormalized) CCA problem. Despite the NP-hardness of the problem, the authors propose an approximate algorithms able to tackle large scale data and that can be easily parallelize. The algorithm proposed by the authors relies on a low-rank decomposition of the matrix and the rank of this decomposition parametrizes the trade-off between speed and accuracy. The proposes method is then applied on two real-life datasets (breast cancer and brain imaging) with favorable results.

Clarity - Justification: As formulated in Eq 2, the problem is a sparse SVD problem and as such, it is slightly different than sparse CCA problem. This choice is explained and backed-up by relevant references. However, in my opinion, the link with sparse SVD should be enforced (as Eq 2 is more general and not specific to CCA models).  The notations are counter-intuitive and do not follow the classical choice of the SVD literature. At first, I have been confused by the use of x and y for the canonical vectors. Indeed, usually, x and y are used for the theoretical variables (which observations are the x_i and y_i gathered in the matrices X and Y). Usually, u and v are preferred for the canonical variables. 

Significance - Justification: As previously said, the proposed method is of great interest. However, the impact of the proposed method could be broaden if the method was used as an algorithm of sparse SVD (which a sparse CCA is an instance of). Then other problems like matrix completion could be tackled.  Hence, in that spirit, I believe that the following papers could be referenced (as they are linked to sparse SVD) : - "Biclustering via Sparse Singular Value Decomposition" by Mihee Lee et al. - "A Sparse SVD Method for High-dimensional Data" by Dan Yang et al. - "A DC Programming Approach for Sparse Eigenvalue Problem" by Mamadou Thiao et al. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The method proposed for solving the sparse CCA problem is very interesting (and novel to the best of my knowledge). However, it is a pity that the paper only focuses on CCA as it could be used in a more general setup for sparse SVD (and then applied to a broad range of problems).  The convergence for the bound is quite interesting but I would also have been interested in an empirical experiments (on controlled toy data) in order to analyse the behaviour of the methods w.r.t changes in r and T.  particular comments : - in the abstract, to what corresponds the "ambient dimension" ? the space of the original input or the space of the projected inputs in the lower dimension ? - the change of notation (for the constraints) between Eq 3 and Eq 4 is slightly confusing. - the sentence "Theorem 1 establishes a trade-off between ... but the former depends exponentially in r" is also confusing. Should not the sentence be the contrary ? (as the quality of the approximation decreases when r decreases ? - it may be obvious but I was wondering if the role of x and y could be swapped in the algorithm (using the transpose of A) ? would the bounds be the same ? (my guess is that it does change anything but this could be addressed in the paper).

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a sparse CCA algorithm that has provable data-dependent guarantees, complexity that scales linearly in the input data dimensionality, and offers precise control on the sparsity of the extracted canonical vectors.  The proposed algorithm is shown to outperform the state-of-the-art sparse CCA method from Witten et al.

Clarity - Justification: The paper is very clearly written and easy to follow.  It does an excellent job at relating this work relative to the previous literature, as well as presenting the intuition and overview of the method before getting into technical details.

Significance - Justification: The authors demonstrate that the proposed sparse CCA algorithm outperforms the algorithm of Witten et al. both in terms of the correlation found (i.e., the objective), as well as execution time.  These results are convincing.  The authors then present second application of brain imaging.  Does the proposed algorithm outperform that of Witten et al. for this application?  Are there settings where the proposed algorithm performs worse than that of Witten et al.? 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a strong paper, both technically as well as in its application to data.  The paper is also well written, making it accessible to a wide audience.

=====