We thank the reviewers for their valuable comments. $
Concern 1: There is no advantage to the Procrustes Flow (PF) algorithm over projected gradient descent (PGD). 

This is a fair point, and we should have clarified this more in the paper. PGD works best in the regime where storing an n x n memory is feasible. This case works great for PGD because one can evaluate the gradient *once* per iteration and cache it. Then, the Lanczos method to calculate the rank-r projection only requires dense matrix-vector products, which is reasonably fast since the gradient is already stored in memory.

However, when this does not hold, the iteration time for PGD far exceeds that of gradient descent. Without having the gradient cached in memory, for every iteration of Lanczos one has to actually recompute the gradient. Hence, the cost of PGD is roughly N * the cost of GD, where N is the average number of Lanczos iterations needed per iteration. While the exact value of N depends on the matrices at hand, in the worst case it can be significantly greater than one. 

The natural question then is when storing n x n matrices is not feasible. But this becomes the case very quickly: if n=100k, r=100 (a very reasonable number), then storing (in double precision) the n x r factor requires 80MB, but the full n x n matrix requires 80GB.

Furthermore, we believe that the factorized gradient scheme is fundamentally more versatile in a variety of settings: 
1) Parallelization: It is much easier to distribute the factorized gradient scheme.
2) Other constraints on the factors: In general it is difficult (in fact in some cases NP hard) to impose additional constraints on the factors via a PGD framework.

Finally, Section 6 of [Zheng and Lafferty 2015] and Section G of [Bhojanappali et al. 2015] contain experimental results which compares a PF style gradient descent algorithm against PGD (a.k.a. SVP). Both set of simulations suggest that PGD can be slower due to the increased complexity of each iteration. 

Concern 2: The 1/m step size during the initialization phase is too small in practice.

Thank you very much for pointing this out. Indeed you are correct that 1/m is too small. Our step size in PGD should be 1 instead of 1/m; that is, \alpha_\tau in Theorem 3.2 and 3.3 should be set to 1 not 1/m. Likewise there should not be any 1/m in the statement of Lemma 5.2. All theorems and lemmas are correct modulo this change. The reason the [Oymak et al. 2015] result requires 1/m is because of a different scaling in their measurement matrix; our measurement model is equal to the measurement model of [Oymak et. al. 2015] divided by sqrt(m). 

Concern 3: What is the difference between the PF algorithm and the algorithm presented in the [Zheng and Lafferty, 2015] paper? 

There are a few differences between these algorithms (in the symmetric PSD case). In [Zheng and Lafferty, 2015], one step of PGD is run to initialize the algorithm. In PF, a logarithmic in kappa number of iterations of PGD are run to guarantee that our initialization lands inside the region of convergence. In practice, as a rule of thumb this involves no more than say 10 iterations. 

After initialization, the main difference between our algorithms is the step size required.  Our framework allows for a constant step size whereas [Zheng and Lafferty 2015] require a much smaller step size of order 1/nr. As a result we have a significantly faster convergence rate (see lines 517-523 of the paper for a detailed discussion of this point). 

Concern 4: Comparisons with the [Zhao et al. 2015] paper.

The [Zhao et al. 2015] paper appeared 5 months after our paper on arXiv (we cannot point to the reference for obvious reasons). Regardless there are still major differences between this paper and ours.
	
Theorem 1 of [Zhao et al. 2015] states that recovery is possible if the r-RIP constant delta_r <= O(1/r), whereas our main results only require the r-RIP constant to be bounded by a constant i.e. delta_r <= O(1). This means that the number of measurements (sample complexity) we obtain is sharper. For example, for Gaussian sensing ensembles our sample complexity is m=Omega(nr) whereas [Zhao et al. 2015] requires a sample complexity of m=Omega(n r^3).  Furthermore, they assume that the condition number of the unknown matrix is a constant, whereas we do not require such an assumption. 

Concern 5: Lack of specific applications and idealized sensing model.

A good example is accelerated magnetic resonance imaging using low rank constraints on images or videos. The measurement model in these cases is random Fourier measurements which obey matrix RIP based on a recent result of [Oymak et al. 2015]. That being said we completely agree that the RIP assumption is far from ideal. We view this as a first step and hope to prove similar results with much less restrictive assumptions in the future.