Paper ID: 440 Title: Low-rank Solutions of Linear Matrix Equations via Procrustes Flow Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors of this paper establish convergence and sample complexity bounds for a two-step non-convex algorithm for finding a low-rank matrix satisfying linear matrix equations. The proposed algorithm first carries out a small number of projected gradient steps. The projection step involves SVD, which is potentially expensive but the authors proves that this gives a reasonably accurate initialization. Then it performs inexpensive gradient steps. Projected gradient algorithm has been studied extensively. In particular the current proof critically relies on a result in Oymak et al. (2015) (see Lemma 5.2). On the other hand, the convergence of (stochastic) gradient algorithm has recently been studied by Candes et al. (2015), Zheng and Lafferty (2015), Sa et al. (2015), and so on. In particular the current paper uses the framework of regularity condition developed by Candes et al. (2015) (see Lemma 5.6). The list of related work (Section 4) is quite extensive. Clarity - Justification: The paper is written very clearly. It is organized well and the main results are presented concisely. The related work section mentions Meka et al. (2009) "Guaranteed rank minimization via singular value projection" but does not elaborate what the strength of the current approach compared to pure SVP is. The RIP requirement seems basically the same. Significance - Justification: I see two points that distinguish this work from previous studies. First is the use of restricted isometry property. This is a well-studied technique but as far as I know not in the context of solving low-rank SDP via factorization of the form M =UV^T, and makes the analysis more transparent compared to previous work. Indeed this seems to make the difference in the improved sample complexity compared to Zheng and Lafferty. Second this paper combines projected gradient algorithm (SVP) and gradient steps on factors. The authors should clarify better why both steps are necessary. See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Wouldn't the non-PSD case follow easily from the PSD case by saying M = [U;V] [U^T, V^T]? It seems that the objective (2.4) has an additional term in the non-PSD case, of which I can see the point intuitively, but I would like to know why the proof for the PSD case does not apply to the non-PSD case. - Another question about the non-PSD case: distance of the form (3.6) only seems to care about rotational invariance but as the authors comments around line#200, generally there is an ambiguity of a regular matrix that should apply inversely to U and V. For example, if I take a solution (U,V) and transform it as (UP, VP^{-1}), the distance (3.6) will be very different. Is this a meaningful distance? Or does the additional term in (2.4) solve this? - Given that Meka et al. obtained guarantee for the first part of the proposed algorithm (singular value projection) with a similar RIP requirement, it was not clear to me what the additional gradient steps adds in terms of the overall statistics - computation tradeoff. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of finding a low-rank matrix M which satisfies a set of linear equations, subject to the condition that the operator defining the equations satisfies an appropriate RIP property. The main contribution is combining a warm start gotten by projected gradient descent (which was known before, but requires performing rank-r SVD) with gradient descent on a non-convex objective which in principle involves only matrix-matrix multiplications. The non-convex gradient descent maintains matrices U,V and is analyzed by decomposing M = X \Sigma Y^T, and defining a potential function which measures the "closeness" of X \Sigma^{1/2} to U and \Sigma^{1/2} Y to V, "up to rotations". The decrease of the potential function necessary is quantified by essentially "strong-convexity in the direction of the ground truth". (Def. 5.5) Clarity - Justification: The assumptions are cleanly stated, and the writeup is easy to follow. The proofs are very natural, using the fact that a RIP matrix behaves like the identity on low-rank subspaces, and the proof sketches are also easy to follow. The full proofs in the appendix are clearly written too. Significance - Justification: From a mathematical point of view, the paper is pleasurable to read, however follows reasonably closely the choice of potential functions, and arguments on why it decreases as Candes et al. (2015). The novelties come about when using the RIP properties, which are interesting. My main worry, however, is that I'm just not convinced of what are the cases where this algorithm is preferable to just running the projected gradient descent based algorithm used for initialization. (The authors spend a paragraph starting at 420 about this, but the discussion there is qualitative and vague rather than quantitative.) What I mean by this is that, ignoring condition number dependencies, rank-r SVD on a matrix M typically amounts to nnz(M)*r running time. But M is some intermediate iterate in a gradient descent procedure, so likely this should take about n^2*r time. Since the gradient descent updates in the paper involve multiplying A, an n x n matrix by U_{\tau}, an n x r matrix, and U_{\tau} is again, an intermediate iterate which we have little control over, it's difficult to imagine this would take less than n^2*r running time, which is about the same. (Even accounting for potentially very efficient matrix-vector multiplications.) If the main worry is condition number dependencies, that dependency will be incurred by the few iterations needed to perform the warm start anyway. To remedy this, the authors should either indicate some class of A's where the algorithm in the paper is provably more efficient, or support it with some experimental results. I'm happy to change my decision, if convinced otherwise in the rebuttal phase. (In any case, I think this discussion is extremely relevant and should be added to the paper.) Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Most of the major comments I mentioned above. The paper is quite cleanly written but here are some minor things/typos: -- It would be nice to mention issues of robustness. (In other words, what happens to the guarantee if there is a low-rank M approximately satisfying the linear equations.) Similarly, it would be nice to mention some applications of the problem as stated. (A survey is cited, but a single sentence or two would be nice for readers who are not familiar with the area.) Finally, some justification for the RIP property, again from the point of view of applications, would be good.(And again, even a single sentence would do.) -- It would be good to mention worst-case hardness of the problem (whatever is known) for theory audiences. --In the "Require" sections of the algorithm, repeat the fact that \mu, \alpha are step sizes. Typically \mu is used as an incoherency parameter, and I was very confused to the meaning of it until the statement of Theorem 3.2. -- Line 260, the reference is wrong. It should be (3.2) presumably. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the low-rank recovery problem, in the linear under-determined case, from a non-convex point of view. The non-convexity originates from the simple observation that any low-rank matrix M \in R^{n1×n2} can be written in a matrix factorization form as M = UV^\top, where U and V are “tall” full rank-r matrices, with appropriate sizes. Given this observation, one can try to optimize over the factors directly, where the number of parameters to maintain, store and optimize on are usually much less than n1 · n2. The motivation lies in the fact that, operating in this factored form instead of the original n1 × n2 variable space, one can avoid computationally expensive steps onto low-rank matrices through SVD calculations (in the special case of PSD square matrices, we require eigenvalue decomposition). The main contribution of the paper is the characterization of the conditions required so that such a gradient descent scheme over the factors is guaranteed to converge to points (U_{\tau} , V_{\tau} ) such that U_{\tau}V_{\tau}^\top = M . They do so for two cases: (i) when M is rank-r and PSD (and thus M = UU^\top) and, (ii) when M is general rank-r matrix. These conditions include: (i) proper initialization such that the starting point of the gradient descent over the factors lies within the basin of attraction of the optimal solution (up to rotations), (ii) a proper step size selection in the gradient descent recursion, (iii) the number of measurements required such that these steps work (a statistical touch in this optimization problem). This is a very well-motivated problem and this paper makes some progress on getting theoretical convergence bounds in the setting of practical interest. Conceptually, the analysis follows the main steps we currently know for proving that gradient descent works: We need to establish that somehow the gradients at each step are correlated with the direction pointing towards the optimum. The authors focus only on the specific instance of linear matrix sensing. Key ingredient for the proposed theory is the nature of the problem. Based on high-dimensional statistical arguments, f(U) := \frac{1}{4} \| b − A(UU^\top)\|_2^2 = \frac{1}{4} \| A(M) − A(UU^\top)\|_2^2 \approx \frac{1}{4} \|XX^\top − UU^\top\|_2^2. It would be interesting to check under which other conditions and problem settings the theory could be used. Clarity - Justification: The paper can be split into two parts: the algorithmic part and the theoretical part. (i) In the theoretical part, the authors do a great job in presenting the main ideas. In Section 3, the required tools are presented with clarity, comments after main results make theorems more “digestible”. Furthermore, the proof section provides adequate overview of the steps followed in theory so that the reader gets the main ideas that make this proof go through. (ii) In the algorithmic part, the authors present the steps of the algorithm in Section 2. The paper comes with no experimental results which might make difficult for the reader to appreciate the significance of this work. 1. The authors, based on the work (Oymak et al., 2015), use as an initialization procedure the standard iterative hard thresholding procedure for matrices, with step size 1/m where m is the number of measurements. While according to Lemma 5.2 such a selection leads to linear convergence, in practice such an initialization with step size 1/m makes almost no progress towards the solution (m could be really small). The reviewer checked the performance of the algorithm in practice and observed that more "clever" step size selections work better; see also below. 2. How does the initialization relate to that of Zheng and Lafferty, when only one step is taken (and the initial point in proj. gradient descent is the zero point)? While the reviewer understands that are significant differences with (Zheng and Lafferty, 2015), it would be pedagogical to also describe the similarities between the algorithms. 3. As the authors mention, knowing T_0 apriori might not be practical, since κ is unknown. To this end, Lemma 3.4 proposes a practical rule to check whether initial condition is satisfied. How many iterations are required in practice for this to happen? While this question depends on the problem at hand, it would be informative whether one can still achieve the desiderata with a small number of iterations in initialization. While there is lack of space to include such experimental result in the main text, the authors could include some in the Appendix. All the above reduce the justification of the premise that gradient descent on the factors could potentially lead to faster algorithm or whether such an algorithm is faster than that of e.g., Zheng and Lafferty. Significance - Justification: The paper makes a significant contribution in the area of low-rank matrix factorization problems. While this is not the first paper that proves convergence to such low-rank matrix problems from a non-convex perspective, there is a lot of potential in the paper that differs from state-of-the-art. As the authors clearly state in Section 4, most of existing works on such problems focus on the PSD case. Characteristic examples are the work of (Chen and Wainwright, 2015) and (Bhojanapalli et al., 2015). This is one of the first to consider rectangular matrix cases and would trigger more results on this matter. (i) To the best of my knowledge, the work of (Zhao et al, 2015) considers non-squared cases and have the low-rank recovery problem as a special case (as well as matrix completion). How does this work compare with that? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (For the next comments, the reviewer has performed experiments - unfortunately, CMT does not provide the option to upload a .pdf to show some plots). What follows is not in importance order, but comments in serialized order (from reading the paper top to bottom). The optimal point is denoted as X^\star \in R^{n1×n2}. For all proj. gradient descent realizations, we used step size α = 1 which is much more aggressive than α = 1/m. 1. Page 2, lines 136-138: The authors should also refer to the works “Normalized Iterative Hard Thresholding for Matrix Completion”, “Matrix Recipes for hard thresholding methods” and, “ADMiRA: Atomic Decomposition for Minimum Rank Approximation”. All approaches are extensions of Meka et al., 2009 and show to perform much better in practice. This in sequence means that, approximately after the same computational effort with Meka et al., 2009, one gets to a better initial point (this has been observed in practice - figure is omitted). Is step size α = 1/m in the initialization part important for the theory to work? 2. Page 2, lines 202-203: In case the additional term is removed, how does the quality of the solution changes? Is the additional term important? Is the additional term useful only in theory? What is the property of the additional term that makes it suitable to be used? In practice, it seems that both work (figure omitted). Please elaborate. 3. Page 3, line 285: change eq. (3.5) to eq. (3.2). 4. Page 3, line 292: (this comment holds for both PSD and non-square cases) In Theorem 3.2, we assume M to be rank-r exactly. Moreover, PF assumes we know exactly r. Can the authors discuss a bit or just comment the cases where: (i) we overshoot r, i.e., we use an estimate of r such that σ_r(M) = 0, and (ii) we undershoot r, i.e., we use r such that σ_{r+1}(M) > 0. Especially, the last one corresponds to the case where an additional error term needs to be “paid” for neglecting the spectral tale of M. 5. Page 4, line 392: ”then Theorem 3.3 will hold...” please correct. 6. Page 4, line 395: this comment focuses on the practical rule of Lemma 3.4 (figure omitted). Reviewer conducted the following experiment: we consider the PSD case and test two cases: (i) PF knows the condition number and thus T_0 = Ω(log(\sqrt{r}κ)) can be used, (ii) PF uses the initialization rule in Lemma 3.4. The results show that, in the former case, knowing κ leads to a small number of iterations of the proj. gradient descent; this is favorable since someone can only afford a few such steps. This also leads to an initial point which is further away from X^\star (gray curve in outer plot). On the other hand, in the more realistic scenario where κ is unknown, Lemma 3.4 leads to using proj. gradient descent for a significant time (proportional to the overall time required for convergence). Is there a characterization of how many iterations are required so that the rule in Lemma 3.4 is satisfied? 7. Page 4, lines 421-424 and Page 5, lines 463-465: Please cite as suggested above. 8. Page 5, lines 474-489: In this paragraph, it would be helpful to also add the comparison with the work: ”Guaranteed matrix completion via non-convex factorization”. 9. Page 5, lines 533-558: However, both (Chen and WainWright, 2015) and (Bhojanappali et al., 2015) consider more general problems than the one in this paper: in the former case, the authors solve problems that fail to be even convex originally and rely on some new conditions to be checked for every problem, while the latter consider problems that are smooth (and maybe strongly convex), where low-rank recovery from linear measurements is a special case. Moreover, in the former, some projections can be accommodated. These can be further highlighted in the paper. 10. Page 6, line 582: Remove one “can”. 11. Page 6, line 610-613: Rephrase these two sentences. Overall: I believe that the paper deserves acceptance since it is a well-motivated problem and solving non-convex problems is a current trend in machine learning and optimization. =====