We thank all the reviewers for their careful reading and insightful comments. We are particularly glad to learn that despite the technical nature of the paper, all reviewers found it well-written and largely easy to follow. Below are our responses to the concerns and questions raised in the reviews.$
To Reviewer_1:
Thanks for your positive comments.

To Reviewer_2:
1. We can write D_rho(X)=grad F(X)-2(rho-1/4)X(G^TX-X^TG). The term X(G^TX-X^TG) acts like a momentum on the tangent space and 2(rho-1/4) is the relative scaling. Besides, D_1/4 corresponds to projected gradient under the Euclidean metric, while D_1/2 to the canonical metric (recall that a Riemannian manifold can be endowed with different metrics; see also (Wen &amp; Yin, 2013)). Numerical results in both (Jiang &amp; Dai, 2015) and our work show that rho=1/4 is better than rho=1/2 in many cases. To make our analysis more general, we focus on D_rho(X) rather than grad F(X).

2. The question does lead to some very intriguing issues. Although it is not easy to generalize our techniques to higher-order polynomials (as they heavily rely on structure), one may still be able to reach the same conclusions (i.e., Lojasiewicz inequality and linear convergence) using other techniques. In a preliminary work, we show that this is the case for a class of quartic optimization problems over the unit sphere, which arises in the computation of ground states of Bose-Einstein condensates. It would be interesting to see if the tensor problem is also amenable to analysis.

3. Thanks for the suggestion. We will move some proofs to the appendix and use the space to explain the connection in more detail.

To Reviewer_3:
1. Uniformity of eta in Prop.2: The current proof is indeed incomplete. To get a uniform eta, we simply use the compactness of St(m,n) to show that ||nabla F(X)^TX||_F = 2||BX^TAX||_F<=2nab, where a,b are the largest eigenvalues of A and B, respectively.

2. Proof of Thm.1: Thanks for pointing out the gap. It can be fixed by using Prop.3, which allows us to partition the critical point set Xcal into disjoint subsets, each pair of which is at a distance of at least 1 apart (lines 550-558). Moreover, the objective value is constant over each subset. Thus, if X in St(m,n) and X_1,X_2 in Xcal are such that ||X-X_i||_F<=delta with delta<1/2, then X_1,X_2 belong to the same subset with F(X_1)=F(X_2). Hence, |F(X)-F(X_1)|=|F(X)-F(X_2)|, which, together with the argument in lines 443-450, implies Thm.1 holds as it was stated.

With the above fixes, the claim in line 382 remains valid.

3.Retraction-dependent rates: This is because the rate at which the limit in Eq.(2) tends to 0 depends on the retraction, and this rate will affect the choice of the step size and hence the convergence rate. Such dependence arises in the analysis framework developed by (Schneider &amp; Uschmajew, 2015), which we are unable to elaborate due to page limit. However, following your advice, we will move some proofs to the appendix and use the space to explain things in more detail.

4. Generality of proof techniques: Since non-convex problems have more diverse structural properties than convex ones, an one-hammer-for-many approach could lead to rather weak results. In fact, almost all recent papers on non-convex optimization tackle very specific problems (e.g., phase retrieval, matrix completion, etc.) and employ techniques that cannot be easily generalized. In comparison, the problem studied in our paper is already quite general, as it includes various formulations as special cases; see also the comment by Reviewer_1.

To Reviewer_4:
1. Applications where B is not I: In fact, even for PCA, the use of a non-identity B can help in removing the rotational ambiguity of the solution. For instance, if we take B=Diag(1,2,…,n), then the columns of the optimal X^* will correspond to the eigenvectors of A in increasing order (see Prop.3). By contrast, if B=I and X^* is an optimal solution, then so is (X^*)Q for any orthogonal Q; i.e., there is rotational ambiguity.

Another application is to find invariant subspaces by solving the matrix equation AY=YB (Dongarra, Moler &amp; Wilkinson 1983). The solutions are precisely the critical points of our more general problem.

Lastly, the cost function tr(AXBX^T) has been used in canonical correlation analysis; see (Yger, Berar, Gasso &amp; Rakotomamonjy 2012).

Thus, the problem studied in our paper does have machine learning applications beyond PCA. In fact, we did mention two of the above applications and some others in the introduction.

2. Asymptotic rates: As this reviewer him/herself pointed out, not even asymptotic rates are available prior to our work. This speaks of the difficulty of the problem and our technical contribution. With that said, we agree that obtaining non-asymptotic rates is important and is a natural next step in our research.

3. Retraction-dependent rates: Kindly refer to point 3 of our response to Reviewer_3.