Paper ID: 139 Title: Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This is an interesting, though in some ways speculative, paper, that builds off of a recent paper on a variance-reduced SGD approach for PCA. PCA and methods for finding leading PCs from data has recently received much attention, primarily because traditional algorithms (SVD) are computationally and memory intensive, and this can be a potential problem for large scale data. Recent years have seen significant effort in analyzing stochastic approximation style algorithms for PCA. While classical results show that one should expect convergence asymptotically, obtaining rates of convergence has been an important problem. The VR-PCA algorithm borrows an idea from SVRG (stochastic variance-reduced GD) and combines stochastic update steps with very occasional “full gradient steps” which here in this context corresponds to full power method updates. The main challenge in proving convergence rates comes from the non-convexity of the problem. Given the advances of the recent VR-PCA algorithm, this algorithm presents three main contributions. The first is to extend VR-PCA to the case of recovering k > 1 PCs instead of just 1. The modification to the algorithm is precisely what one would expect: each step involves an orthogonalization step. The technical challenge, apparently, is that the “variance reduction” is controlled by the proximity of the column space from step to step, and not the matrices themselves (these things are equivalent in the k=1 setting). They also demonstrate that the algorithm performs well if initialized from a starting point that is a random point with a full power iteration performed on it. This result is more interesting, as it appears that it may be of somewhat more general interest. Perhaps the most interesting aspect of this paper is also the most speculative one — the authors consider whether there is a hidden convexity property to the problem (and not in the lifted matrix space, where this result is well-known, but apparently computationally unusable). While the result that is shown is apparently not computationally exploitable (because it holds only very close to the optimal solution), it is nevertheless theoretically intriguing. Clarity - Justification: The paper is clearly presented. Significance - Justification: The topic is important. There have been many, many papers on this topic recently, that make each individual paper's contribution somewhat incremental. That said, the need for fast, light weight algorithms for PCA type problems is important. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, this paper seems interesting. It resolves some of the questions left open in the VR-PCA paper. Nevertheless, as explained in more detail in the comments above, it does leave a somewhat speculative feel, particularly for what is ultimately a follow-on paper. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends the VR-PCA, a stochastic variance reduction gradient method, for top k eigenvectors computation. Comprehensive analysis of the algorithm and related topic is presented. Clarity - Justification: This paper is well-written and easy to follow, which offers a joyful review. Significance - Justification: Though the possibility of top k extension was discussed in [Shamir, 2015], this paper presents a convincing progress on this topic. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): SVD or PCA is clearly a fundamental and interesting problem for machine learning community. Though [Shamir,2015] discussed two potential solutions of top K eigenvectors computation, a rigorous analysis is missing. This paper is a nice extension of that work and key challenging is well handled. I also appreciate the further discussion on warm start (non)convexity topic. I have no complain on the theoretical part. As a work on new algorithm tailored to a specified problem, it would be stronger if experiments on synthetic data sets are presented to support its analysis. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes VR-PCA algorithm for finding k leading singular values. It proves some supplementary theories for the case of k=1. Clarity - Justification: The paper is well written and it is an enjoyable read. The potential reader can go through all arguments in the paper with no difficulty. Significance - Justification: This paper investigates an important and well-known problem (leading eigenvalue problem) that has been applied in a variety of problems. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposes a variant of VR-PCA to handle the cases where more than 1 singular value is need and its convergence analysis is provided. This paper also includes the supplementary analysis for the case of 1 singular value. The paper is well written and it is an enjoyable read. The potential reader can go through all arguments in the paper with no difficulty. In the first part, the paper develops an algorithm akin to VR-PCA for 1 singular value. The convergence analysis in this paper shows that rate of convergence is similar to the case of 1 singular value. The convergence, however, depends on starting at a point satisfying a condition. This weakens the theoretical contribution, because it is not clear if a point would satisfy that condition. There is no numerical simulation showing how deviating this condition would affect the behavior. The most interesting section to my view is the second part of this paper that gives some analysis for the case of 1 singular value. It shows that one iteration of the power method can give a point that with a higher probability satisfies the condition needed for the convergence rate analysis. It would have been interesting that the authors support this result with numerical experiments. Another interesting theorem shows that starting with random initial point, the algorithm with high probability can find an initial point satisfying the conditions needed for the main convergence theorem. Finally, the paper shows that around a neighborhood of the global optimum, the objective function is strongly-convex which shows that better runtime may be possible. This results is very vague and based on my own experience this strong-convexity around a small neighborhood does not necessarily shows better practical performance. A minor typo around paragraph 188: the definition of positive definite has a tiny mistake because for z=0, z^Bz =0 and the inequality can not become strict. Overall, the paper is interesting and is of value to ICML community. It investigates an important problem (leading eigenvalue problem) that has many applications in machine learning. The only drawback is the lack of any experimental results supporting the theoretical results in this paper. =====