Paper ID: 1224 Title: Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper combines power method and linear system solver for canonical correlation analysis computation. It claims that the complexity of algorithm is lower than prior work. Clarity - Justification: The paper is a bit easy to follow, although the descriptions are a bit dense. Significance - Justification: Although generalized eigenvalue decomposition (EVD) was useful in discriminative analysis etc., nowadays its usefulness has been dwindling. A clear evidence is the surge of deep learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed method is still based on power method, with the inverse of B being computed by a solving linear system (or an equivalent quadratic minimization problem). For power method, it is well known that the convergence is global and the convergence rate is linear, with the rate being related to the eigenvalue gap. So I don't understand why the authors can still claim (e.g., in abstract) that "this is the first such algorithm with global linear convergence." The only trick I can see is to avoid the computaton of B^{-1/2} by simply neglecting it. As almost all the procedures are standard, is there any insight to why the proposed method has a lower complexity? Errors: 1. In line 185, y should be (AB^{-1})^{i-1}AB^{-1/2}. So the subsequent deduction should be modified accordingly. 2. In footnote 1, v should be a nonzero vector. 3. In line 679, the numerator should be squared. 4. In the legends of Figure 1, "3" should be "\theta". 5. Some "S"'s in S_{xy} etc. should be in bold face. Minor comments: 1. How to measure the "FLOPs"? Why recoding the clock time is not used? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a special warm-start procedure to run the QR-algorithm (i.e., power method if k=1) for the generalized eigenvalue problem (GEP). Combined with a linear solver that returns a noisy solution, they analyze the method and show the run-time is reasonable. The GEP and CCA are interesting variants on standard PCA — while mathematically not that different, they have not seen as much interest as PCA. Clarity - Justification: The paper is quite readable, and has been well-thought out. Significance - Justification: I’m not entirely convinced that CCA on big data is a top priority for many data analysts, but CCA and GEP are fundamental problems, and could be the basis for more complicated downstream methods (analogous to using PCA for PCR), and there were no good global results on CCA that the authors knew of. Part of the lack of results could be because the problem may have been considered before as within the domain of scientific computing, which has a different focus on their analysis. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It’s an interesting paper, and I’m overall favorable. The technical portion was deferred to the appendix, and given the reviewers’ workloads for ICML, I cannot check the appendix in detail. If I understand, the main innovation is a good warm start (lines 198—204), but this was not really discussed in the paper. I’d like to have seen more discussion of this warm-start, since it was so key. I have a few minor comments below, but my major comments are that the warm-start should be discussed more, and finding the parameter Q for Nesterov’s method should be discussed. - Abstract, line 040, claims it is linear in the input size and # of components k, but line 035 contradicts this (since it is not linear in k). Furthermore, line 035 is unclear if this is the number of iterations, or overall complexity (iterations x cost/iteration) - Line 073, one “X” is not in bold. - Line 075, “encapsulates” should not have the “s”. - Line 108, it is not clear this is prohibitive yet, since computing the inverse square root is the same complexity as a full eigenvalue decomposition. Rather, you should emphasize that you only want a few top eigenvalues (e.g., as in Lanczos or power methods). - Line 228, citing Spielman and Teng ’04: I thought there result only applied for very particular linear systems. How is it relevant here? - Line 278, isn’t a logarithmic factor in k hidden as well (cf line 035)? - Line 289, “reduces to our choice in the top-1 setting”. I’m not clear on what you mean by this. - Section 6 was interesting - Line 732, “Corelations” is misspelled. - Section 7.1, PTB. It’s not obvious to me how you split this into two datasets. Could you explain a bit more? - Line 761, “is indicator” should be “is an indicator” - Line 764, why are these ill-conditioned, if n is so much larger than d? How does lambda affect your results, in practice and theory? - Line 862, your algorithm only takes fewer computations if you require medium-accuracy, as for low and high-accuracy, it seems that S-AppGrad is better. - How did you compute flops exactly? - For Nesterov’s method to be effective, it needs a valid estimate of Q, otherwise it reverts back to the non-strongly convex cases and converges sub-linearly. How do you find Q? (If you regularize with lambda > 0, then you only need the top eigenvalue so it is much easier, right? In general, this seems like a major problem). I think Gonzaga and Karas ’13 (“Fine-tuning…”, Math Prog v 138) have a Nesterov-variant that estimates Q on-the-fly; this has also been discussed in other papers, beginning probably with Nesterov ’07. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a novel algorithm for solving generalized eigenvalue problems that implements the matrix inversion in the power iteration through approximate linear system solve. They use warm starts to ensure the approximation error is controlled and the overall algorithm linearly converges to the true solution. Clarity - Justification: The paper is clear and easy to read. However, adding some high level idea or a proof sketch may be helpful to understand the results. Significance - Justification: The idea is interesting in that it uses approximate solutions in each iteration and controls the error naturally by using warm starts. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper propose a clever idea of using approximate (noisy) power iteration to reduce the computational burden of matrix inversion. It is well known that the noisy power method converges linearly to within the noise tolerance of the solutions. The novel idea of the algorithm is to use warm start from previous iterations such that the approximation error naturally decreases with the iteration of the algorithm. The paper can be improved by explaining the ideas in the proof sketch otherwise the reader has to go through the appendix to understand how the algorithm works. The experiment section can also be improved. In the small dataset setting, I do not observe the linear convergence property in Figure 1. Maybe the authors can overlay the linear convergence rate curve in dashes for better comparison with the theoretical guarantees. Also in the large-scale experiment, why does TCC of CCALin starts out so low compared with S-AppGrad? And maybe the author can also compare with the version that solves the matrix inversion exactly in the power iteration. Some related works that are not mentioned in the paper: [1] The Noisy Power Method: A Meta Algorithm with Applications [2] Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients =====