Paper ID: 140
Title: Convergence of Stochastic Gradient Descent for PCA
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers stochastic gradient ascent for the (non-convex) problem of finding the leading principal component from data. That is, it seeks to solve the problem max: w’ E [xx’ ]w, or equivalently, min: -w’ E[xx’] w, over the set ||w||=1. In particular, the paper is interested in a 1-pass algorithm that looks at each data point once. Stochastic gradient descent (or ascent) naturally fits this, and also has minimal memory requirements. While there have been several papers that consider convergence rates of this algorithm, apparently all require an eigengap. Indeed, the papers by Hardt & Price as well as Mitliagkas et al. provide a 1-pass algorithm for PCA, but require an eigengap. This paper’s main contribution is an analysis that removes this requirement.
Clarity - Justification: The paper writing is clear. While technical, the authors make a good attempt to explain the key ideas.
Significance - Justification: The paper's main point is removing the eigengap condition in previous works. This is an important contribution
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The key idea in the proof is to consider the quantity V_T (line 530), and to show that it is negative with some probability. The approach is to prove that the expectation is negative, and that it is bounded below with high probability. From this, one can conclude that it must also be bounded above with some probability, or otherwise it could not have the stated expected value. This avenue of proof presents several issues, in particular, as the authors point out, that the probability of success is very low. This is quite an important issue, despite the fact that this paper is addressing a regime where past analysis does not apply. Another issue is the assumption on the size of b, which can also be seen as an assumption on the variance of the stochastic process. Theorem 1 & 2 require that ||At||/||A|| and ||At-A||/||A|| both be bounded by b w.p. 1. If we consider the setting of a Gaussian distribution, we may have some trouble. Even considering the simplest setting where we do have an eigengap, say, as in the setting for Theorem 2, it seems that there are some issues. To explain my confusion here, for simplicity, let us take the family covariance to be A = I + vv’, then ||A|| = 2, where as At will be a d dimensional vector sampled according to N(0,A), and hence we will have ||At|| = \sqrt{d}. The authors point this out in the discussion of the numerical rank, where they say that b^2 \geq n_A, though I find that discussion confusing — it seems that the authors are making this argument to show that n_A is not too big, but I read it more as showing the opposite — namely, that b can be very big. So, for A as above, n_A is about d. If we put b = \sqrt{d} into the bounds, the results are significantly worse than what one would hope for from the batch case: T must be much bigger than the batch case, when we assume that we do not have a warm start (and hence when p = O(d)). In other words, it seems that we need to run the algorithm for T ~ b^2 p = d^2 iterations, rather than d iterations. In the warm start case, then the value of b is no longer an important issue, and I do agree with the conclusions the authors make. And in any case, the improved dependence of Thm 2 on the eigengap compared to prior work is of interest.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The submission provides an analysis of the convergence rate of SGD for PCA, without making any assumptions about the presence or magnitude of an eigengap.
Clarity - Justification: Clear and well-written, and in particular does a nice job, in section 3, of walking the reader through the weaknesses of theorem 1, and a proposed solution.
Significance - Justification: The main novelty here is in doing away with the eigengap assumption, so it's therefore a bit incremental, particularly in light of the fact that the convergence rate is much worse without an eigengap assumption than with one. With that said, an eigengap assumption is rather unnatural, and is rarely the sort of thing that one can easily guarantee on a real world dataset. Hence, the no-eigengap analysis is more realistic.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposal to "warm-start" the algorithm via an approximate power method iteration is surprising, and I'm not completely convinced about it. Isn't the result of this warm start procedure awfully similar to the result of performing T_0 SGD iterations starting at w? Do you believe that this is "real", i.e. that this warm start is better, in practice, than simply performing T_0 additional SGD iterations?
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is the first attempt giving the convergence result for the SGD method applied to the PCA problem that does not depend on any eigengap property of the underlying covariance matrix. For improving the rate of convergence, the authors propose a warm-start strategy and provide theories supporting their method.
Clarity - Justification: The paper is well-written and it is easy to follow the claims of the paper.
Significance - Justification: Given the success of the SGD method for finding leading eigenvector of the covariance matrix, the results of this paper is very interesting.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proves the convergence rate of the SGD for a very popular non-convex problem, that is finding the leading eigenvector of the covariance matrix. It is the first paper providing the convergence rate analysis without an aigengap assumption. The convergence rate depends on a constant that can be large if one uses random initialization. For improving the rate, the authors propose using a warm-start strategy and provide theories supporting their method. Using their methods, they derive new bounds with an eigengap assumption.
=====