Paper ID: 783 Title: Matrix Eigen-decomposition via Doubly Stochastic Riemannian Optimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a novel, doubly stochastic Riemannian optimization algorithm for the problem of finding the dominant eigenspace of a symmetric matrix A. The two aspects of stochasticity are: (1) Sampling blocks of the matrix A, and (2) Sampling columns of the thin orthogonal matrix V used to diagonalize A. While (1) is straightforward, applying (2) correctly and efficiently, especially in the Riemannian optimization setting, is non-trivial, and developed in detail in the paper. Experiments show that the method indeed scales well computationally without sacrificing precision. Clarity - Justification: While the general motivation and introduction were clear, the more technical sections were hard to follow. Many derivations were not explained in detail, and the order of presentation was sometimes confusing. See detailed comments below. Significance - Justification: The paper present a technical innovation in the way it addresses the Riemannian coordinate stochastic sampling. The result is computationally efficient, especially for dense data matrices. In addition the authors derive a straightforward yet useful optimal sampling scheme. The experimental results show the efficacy of the method, and especially promising is the possibility of using this method for matrices too big to fit in memory. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The Riemannian coordinate structure of the Orthogonal and Stiefel manifold was explored by Shalit & Chechik (2014), albeit with the Euclidean metric on the manifold. 2. In sections 3.2 and 3.3, I was expecting the notation F_m or F_r instead of E_r, following the equation in lines 287-288. Indeed, immediately below that equation E_r is used instead of F_m, which is confusing. 3. Lines 327-329: this requires better explanation. Why is there a tensor product here and why is this correct? 4. Lines 327-329: I believe that the off diagonal elements \frac{\alpha}{2} and -\frac{\alpha}{2} g^\top g should switch signs. 5. I found the derivations of C (in relation to how it is derived from W) and equation (7) to be hard to follow. More explanations (perhaps in the appendix) would be appreciated. 6. Please define H^(t) within the box of Algorithm 1. 7. Lines 413-416: it would help the reader if you could add the g_t = (I-XX^\top) A_t Y_t 8. Section 4.2: this section could be more readable if the main results are stated first, and only then the lemmas needed for their proof. 9. Theorem 4.4: what is the role of \beta_t? If I follow the theorem correctly, if \beta_t is too large then no convergence is guaranteed. Could \alpha_t be made small enough so as to ensure convergence to the optimal? This is not clear from the statement of the theorem. 10. Is there a reason the authors did not compare with random projection type methods, such as those suggested in Halko et al. ? Also, the paper could use a comparison with the method of Mitliagkas et al. mentioned below. 11. The paper should cite and discuss two other closely related papers: Mitliagkas, Ioannis, Constantine Caramanis, and Prateek Jain. "Memory limited, streaming PCA." Advances in Neural Information Processing Systems. 2013. Boutsidis, Christos, et al. "Online principal components analysis." Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2015. minor comments: line 278: I believe this should be \frac{1}{2}q(q-1) + (n-q)q, instead of \frac{1}{2}n(n-1) + (n-q)q line 429-430: more precisely, X_{t+1} = V up to a *right multiplication* by a "small" q X q orthogonal matrix line 579-580: previously the authors use q for rank, now using p for rank is confusing. It would also be preferable to state explicitly that this is the number of eigenvectors sought. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a double random sampling algorithm to speed up the gradient descent of a quadratic function on Stiefel manifold, in order to compute the eigenvalue decomposition (EVD). The complexity of each iteration is reduced and the convergence rate is also analyzed. Clarity - Justification: No complex theory if one is familiar with the basics of Riemannian geometry. No suprise in the sampling technique. Significance - Justification: Eigenvalue decomposition is frequently used in machine learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed algorithm targets on optimization on the Stiefel manifold ONLY (actually only the eigenvalue decomposition problem), not general Riemannian manifolds. So "Riemannian optimization" should be replaced with "Stiefel optimization" or "optimization on Stiefel manifolds". The claim that matrix inversion is completely avoided (line 20) should be changed as there is indeed matrix inversions in the proposed algorithm, e.g., the inverse of B (lines 337, 351, 376). The experiments are weak. Other randomized methods for eigenvalue decomposition (e.g., Randomized SVD, which reduces to randomized EVD when the matrix is symmetric, and some other randomized approaches reviewed in section 6) should be compared with, although they are not "designed from the Riemannian optimization perspective" (line 805). And from the middle column of Figure 1, although SRGD-EIGS converges faster at the beginning, it seems that the final objective function values are slightly lower than those by RG-EIGS. So I would suggest that log(obj* - obj) and log(1 - cos) be presented in the middle and right columns of Fig. 1 and 2, where the groundtruth objective function value obj* is given by EIGS as long as EIGS does not break down. Where is (8) (line 489)? Is it the inequality in Lemma 4.2? Using the F-norm to upper-bound the spectral norm (line 517) is not the optimal way. A better and also convenient upper bound is: ||A||_2^2 \leq ||A||_1 * ||A||_\infty. See Golub and Loan 1996. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a stochastic algorithm for computing eigenvectors of symmetric matrices based on ideas from Riemannian optimization coupled with an inexpensive sampling strategy. The idea is to randomly generate search directions in the tangent space to the Stiefel manifold of matrices with orthogonal columns. "Double stochasticity" here applies to the fact that two random samples are needed -- one for selecting blocks of the input matrix A to be factored and the other for dimensions of the variable X. Additional efficiency is obtained via an importance sampling strategy. Clarity - Justification: I am not an expert in this field but had no trouble reading this paper. Significance - Justification: Eigenvalue/eigenvector computation is a key step in algorithms across computational disciplines, not only in machine learning. It is really interesting to see that more can be squeezed out of this problem even though it has existed for (literally) centuries! The empirical case for this algorithm could be made stronger, but the preliminary experiments in the paper show that this method does indeed make a difference. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a well-written paper with some progress on an important classical problem. The authors carry out theoretical analysis and empirical evaluation, which together make a good case for their method. Localized comments below: - The paper mentions several times that eigenvalue computation is a "nonconvex quadratic optimization problem." Of course this is true, but this phrase is a bit misleading --- unlike most nonconvex quadratic optimization problems, eigenvalue computation is well-understood and not computationally infeasible. - I'm more familiar with "double stochasticity" in terms of matrices --- a matrix is doubly stochastic if its rows and columns sum to 1. I understand that the probabilities in this paper make a doubly stochastic matrix, although this isn't stated explicitly (not sure it's necessary). I'll defer to other readers whether this term is easy to understand in the context of the current paper --- the phrase may mislead readers familiar with optimization algorithms like Sinkhorn's method. - One class of algorithms that may be worth comparing to and/or citing is LOBPCG: https://en.wikipedia.org/wiki/LOBPCG --- many implementations are available online, and sometimes this method can accelerate convergence (but guarantees are weaker) [line 067] manifolds --> manifold [Algorithm 1] How do you initialize X^0? Does it affect convergence? - The paper states a few times (e.g. line 96) that matrix inversion is needed for eigenvalue iteration. Why is this? I thought matrix inversion is only needed to compute /small/ eigenvalues rather than large ones. [line 126] Please define q [line 140] "Riemannian" [line 141] an Euclidean --> a Euclidean [line 150] Is a retraction the same as the exponential map from differential geometry? [line 186] Was P_X defined? [line 225] Is there a systematic way to choose block sizes for the A_ij's if the problem does not naturally have such a structure? [line 270] This definition of the tangent space has a nice intuitive interpretation -- the first term rotates the vectors in X and the second term moves in the orthogonal complement. It might be worth mentioning this more explicitly. [equation (7)] This equation is missing a left hand side (I think it's just an updated for X_{.r}) [section 4] I am not an expert in this type of math and did not check the proofs in this section closely [line 489] Perhaps I missed it, but is there an equation (8)? [line 557] is --> are [line 624] Is it possible to fix this numerical drift issue by periodically orthogonalizing the columns of X or is this too expensive? [line 757] Pardon my ignorance, but when does the power method diverge? [section 6] I think this section should appear earlier in the paper. More importantly, it seems this related work should appear in the comparisons in Figure 2. For example, the paper mentions that (Garber et al., 2015) does not have empirical validation --- including it in Figure 2 would fix this issue and help compare the Riemannian method to this other paper. [section 7] This section is not useful and essentially echoes obvious points from the rest of the paper. Please revise so that the conclusion focuses more on high-level take-aways: When should we use your algorithm? At what scale is it most effective? When does it fail? Also, provide more interesting avenues for future research -- I'm betting the authors have more problems to share with the research community than "testing on more data." =====