We really appreciate reviewers' efforts, and will address their concerns in the revision to make it clearer. Below are our responses to major concerns.$
To Reviewer 3:

Our work is quite different from the work of Shalit &amp; Chechik (2014): 1) we focus on general Stiefel manifold, while their work focuses on manifolds of (square) orthogonal matrices which is a special case of Stiefel manifold; 2) their method using Givens rotation is different from ours; 3) they only proved the expected convergence to local solutions while our method enjoys the proved convergence to global solution; 4) their method generally is not suitable for our problem (except that q=n, which is uncommon for our problem). We will add discussions on this work.

We will add more details on tensor product, derivation of C, and other derivations in the appendix. 

\beta_t=E[\|A_t\|^2\|Y_t\|^2] is bounded (since data block A_t is obviously bounded and \|Y_t\|=1) while \alpha_t=c/t (Corollary 4.4.1) tends to zero as t goes to infinity. We will add further explanations on \beta_t.

We have confirmed that signs of off-diagonal elements (Lines 327-329) are correct.

Random projection is quite different from our stochastic method. It needs a full scan over the whole input data each time of manipulating them, and the model can not be obtained without such full scans. This is very slow and impractical on large data. We will add comparisons with this method, also cite and discuss the mentioned papers.

To Reviewer 4:

Our aim is not to propose complex theory, e.g., new Riemanian geometry. Instead, we explore stochastic algorithm with non-trivial theoretical guarantee in the Riemannian setting. The stochastic scheme can greatly reduce the complexity per iteration, making it more practical. There are currently only few studies for our problem from this perspective. To our best, we are the first to: 1) generalize SCD from Euclidean to (general) Stiefel manifold, 2) explore the combination of SGD and SCD in the Riemannian setting, 3) prove the expected convergence of O(1/T) rate to the global solution with respect to either the problem or the Riemannian setting, while currently there are only results of convergence to local solutions in either of cases, 4) apply importance sampling to a nonconvex setting for improving convergence rate. 

B is a scalar when we sample one column of X each time. Then we consider for this case the matrix inversion is avoided.

We believe that our experiments have demonstrated the efficiency and effectiveness of our algorithm. We will add comparisons with randomized method, and follow the suggestions about evaluation measures and upper bound of spectral norm. 

For the middle column of Figure 1, this phenomenon should be normal, since it's quite hard for stochastic algorithms to exactly attain the optimal solution. Especially if the set of optimal solutions is of zero measure in the search space, the probability that stochastic algorithms exactly attain the optimal solution is zero. So they end up wandering around some region close to optimal solution, and thus nearly optimal or sub-optimal solutions are obtained often. In practice, this is not a problem as long as they are close to optimal solution.

Yes, (8) (line 489) is the inequality in Lemma 4.2. We will amend it.

To Reviewer 5:

We will make it clearer on the misleading phrases.

According to https://en.wikipedia.org/wiki/LOBPCG, LOBPCG seems to only find top 1 eigenvector while ours aims at top k (k>1 generally) eigenvectors.

We randomly initialize X^0, which does not affect convergence in theory under the condition that it can make cos(X^t, V)>1/2 at some step t (see supplementary line 145). Theoretically this condition is non-trivial, but empirically random initialization works well as we observed. We will make it clearer.

Matrix inversion for RG-EIGS is required due to the used retraction.

The exponential map is a retraction. However, a retraction generally is not an exponential map but its first-order approximation that enables efficient computation.

Yes, P_X is defined after S(\xi) (line 186-187)

Theoretically there are no restrictions on block size as long as it can make data block fit into memory. However, in practice on large data, it should be made much smaller than the size of whole data. We have not yet investigated empirically if there is a systematic way to choose block size. But it's worth doing and we will consider in the extension.

We are sorry about it. Equation (8) refers to inequality in Corollary 4.4.2. We will amend it.

Yes, periodically orthogonalizing columns of X is costly when q is also large.

On power method, for example, when two dominant eigenvalues of the given matrix have the same absolute value but opposite signs, it will fail to converge.

The method proposed in (Garber et al., 2015) only finds top 1 eigenvector.

For section 7, we will follow the suggestions to revise.