Paper ID: 543
Title: Quadratic Optimization with Orthogonality Constraints: Explicit Lojasiewicz Exponent and Linear Convergence of Line-Search Methods

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the problem of optimizing a (matrix) quadratic function over the Stiefel manifold, of which principle component analysis is a special case. Previous work has shown that retraction-based line search gradient methods converge to a critical point for this problem. The main contribution of this work is to explicitly derive the Lojasiewicz exponent hence proving the (local) linear convergence rate. Limited experiments are provided to verify the theoretical results. 

Clarity - Justification: The paper is largely easy to follow, but some details could be improved. For instance, in all experiments, it is claimed that the algorithm converges to the global optimum, but details on how the conditions in Theorem 3 are met are missing. Some of the technical proofs can perhaps be moved to the appendix, and the saved space can be used to better articulate the difference or difficulty in deriving the error bound in Theorem 2 as compared to the convex case. 

Significance - Justification: Optimizing a quadratic function over the Stiefel manifold is certainly an interesting and important problem. The main contribution here is mostly on the theoretical side, reassuring the effectiveness of retraction-based line search gradient methods. However, the proof techniques do seem to heavily exploit the matrix structure hence may not be very useful for other nonconvex problems. A more general result, other than just deriving the exponent for a particular problem, would have made the paper much stronger.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors did a very good job in introducing the problem and the relevant literature. It would be nice to mention that the quadratic problem admits a randomized approximation algorithm, considering especially that the work of (So, 2011) was cited.  The main results in Section 3 seem to have some gaps. First, it is clear from the proof of Proposition 2 that eta depends on X^*, hence the correct statement should be "for all X^* in Xcal, there exists some eta, blablabla". The proof of Theorem 1 is also problematic: the proof in lines 443-450 only showed that "for all X, there exists X^* such that the inequality holds", while Theorem 1 claims that "for all X and X^* (sufficiently close) the inequality holds." Besides, contrary to the claim in line 382, the constant eta is NOT uniform, since it is not uniform in Proposition 2. Due to these flaws, I am not sure if one can actually use the results in (Schneider and Uschmajew, 2015) to claim linear convergence rate.  The QR-retraction algorithm is simply projected gradient, which perhaps worth mentioning explicitly. It is puzzling that different retractions lead to very different slopes of convergence under different settings.   In line 480, 1/(2 rho) should be max( 1, 1/(2 rho) ).

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a proof regarding the convergence rate of a class of first-order Riemannian optimization methods for solving an important class of quadratic optimization problems over the Stiefel manifold. The paper shows that these methods have a global linear convergence rate to a critical point. For the special case of finding the k smallest eigenvectors of a symmetric nXn matrix, the authors prove that first order Riemannian optimization converges to the global optimum under certain conditions about the initial point and the eignegap. The main technical tool employed is are Łojasiewicz inequality and the accompanying theory, as developed in previous work cited by the authors.

Clarity - Justification: The introduction and motivation are very well written, and the paper does an excellent job of contextualizing the result. The theorems and lemmas are presented in an easy to follow way and are partitioned into easy to follow proofs in the main paper. I did however find the full proofs of some of the propositions (esp.  6 and 7) to be more technical and harder to follow.

Significance - Justification: The theoretical result presented in this paper is novel and highly interesting. To the best of my knowledge there has not been until now a result showing linear convergence rates over the Stiefel or Orthogonal manifold for non-trivial problems. This result also points out to the potential of using the Łojasiewicz inequality for obtaining convergence rates for non-convex objectives or non-convex constraints.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The authors introduce and extensively use the quantity D_\rho(X). I would appreciate some more context or intuition about this quantity. What is the role of \rho? How should it be chosen and what effect do different choices make?  2. Would the result fail for higher order polynomials? I have witnessed in the past that first order Riemannian methods on the Stiefel manifold seen to have linear convergence rates for the cubic problem of 3-mode tensor diagonalization. I would be curious to know if this could be theoretically proven using the methods in this paper.  3. The result in this paper is of high interest because of the implication of the Łojasiewicz exponent on the global convergence rate. I would appreciate if the authors could explain this connection, rather than just citing previous papers. This is important because the Theorem proved in the paper seems to imply only a local condition, and the connection to the global rate is not immediately obvious.    minor comments: line 436-437: the term equivalently is misleading here. The result follows, but is not equivalent.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This is an interesting paper showing that the simple gradient based approach is linearly convergent for the case of quadratic optimization with orthogonality constraints. 

Clarity - Justification: The paper is well-written and it is very easy to follow the arguments of the paper. 

Significance - Justification: Due to importance of the problem of optimizing quadratic cost function with orthogonality constraint, the convergence analysis is of high value.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper gives an explicit estimate of the exponent in a Lojasiewicz inequality for the set of critical points of the class of quadratic optimization problems with orthogonality constraint (QP-OC).    The paper is well-written and it is very easy to follow the arguments of the paper. The authors start with a complete introduction and overview of the existing techniques. Then, the paper proves the exponent in a Lojasiewicz inequality for QP-OC problem leading to having linear convergence rate result despite the non-convexity of the problem. This paper also proves an interesting theorem for convergence to the global minimum if certain conditions hold. Through numerical experiments, the linear convergence was observed.  Overall, the theory of the paper is interesting and it is applied to a well-known problem with many applications.

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers matrix optimization problems of the form min_{X\in St} tr(X'*A*X*B), where St is the set of m*n matrices with orthogonal columns (X'*X=I). For example, principal component analysis (PCA) corresponds to the above where B=I and A is the negative of the covariance matrix. The paper shows that for any such problem, a gradient descent-type method starting from an arbitrary starting point will converge to a critical point at an asymptotically linear rate. Moreover, in the special case where B=I, A satisfies an eigengap condition and the initialization point satisfies a mild condition, then there will be asymptotic linear convergence to a global minimum (as opposed to just any critical point). The paper concludes with a couple of experiments on synthetic data and MNIST, for the case B=I, demonstrating linear convergence. 

Clarity - Justification: The paper is overall well-written, and reasonably easy to follow despite the heavy mathematical content. 

Significance - Justification: The main result here seems like a potentially useful contribution to parts of the optimization community (in light of previous work which did not even characterize the asymptotic convergence rate). However, I am a little bit concerned regarding its significance to the machine learning community, which ICML is aimed at. Although the problem studied here incorporates PCA as a  special case, the paper doesn't provide any motivation for the more general problem studied here (e.g. an example from machine learning where B is not the identity matrix). Also, the results do not seem to imply something new for the PCA problem, which has already been studied in more detail in previous literature. A second issue is that the results are asymptotic in nature, and do not tell us much about the finite-time behavior of the algorithm, and how important factors such as the starting point or eigengap conditions affect it. It also doesn't inform on the type of retractions that the algorithm should perform, although the experiments indicate that this choice can significantly affect the convergence behavior.   

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main comments are listed above. In terms of soundness, the paper seems fine as far as I can discern, with self-contained and apparently correct proofs (although I have not carefully checked all details).  To summarize, I think this is a well-written and potentially useful contribution to the optimization community, but for a machine learning crowd the paper should have been better motivated. Also, the results are asymptotic in nature which weaken their applicability. 

=====