We thank the reviewers for the detailed and valuable comments.$ Response to Reviewer_1 Thanks for your helpful comments. Your understanding is correct. In particular, the faster rate is attainable since we utilize the large singular values. If all the singular values are small, our results reduce to existing results with nuclear norm regularization. In addition, based on your suggestion, with the simulation settings changed slightly, we generate a new result figure: setting n=4 r m log (m) with r=3, noise variance sigma = 20/m, and varying the value of m from 40 to 200. We plot the estimation error against the matrix size m for both our method (denoted as SCAD) and nuclear norm regularization based method (denoted as Nuclear). In this figure, x-axis is m, and y-axis is the root of mean square errors. Indeed, we observe that the estimation error of our estimator is much smaller than the nuclear norm regularization based one. However, according to the rebuttal policy that “Please do not include any URLs in your response”, we are not sure if it is legitimate to include the URL here, thus we have provided an anonymous link of the figure to the Area Chair and leave it to him/her to decide if we can show you the figure. We apologize for any inconvenience this may cause. We will anyway include the figure in the camera-ready version. On the real world datasets (section 4.2), we use cross validation to choose the value of \lambda. In detail, since \lambda = O(\sqrt{\log (M)/(nm)}), we set \lambda = C*(\sqrt{\log (M)/(nm)}), and tune C by cross validation on the training dataset. In general, it is possible that we can use scaled Lasso (or square root Lasso) technique plus nonconvex penalty to estimate the variance of noise, and in the meantime still achieve a faster rate of convergence. However, the resulting estimator will be different from our proposed estimator in this paper, hence we plan to explore this extension in the future work. Thanks for pointing it out. For L_min, it can be chosen (as stated in Line 341) such that 0 < L_min <= L_{\tilde{L}_{n, \lambda}}, where L_{\tilde{L}_{n, \lambda}} = \rho(X) - \zeta_- (as given in Line 443). For specific examples, we establish \rho in Proof of Corollary E.5 (Line 2233-2253) for matrix completion and Proof of Corollary E.8 (Line 2465-2463) for matrix sensing. Moreover, \zeta_- = 1/(b – 1) for SCAD (Line 438) and \zeta_- = 1 / b for MCP (Line 441). As we have discussed in Line 691 and 694, for SCAD, we set b = 1 + 2/\kappa(X), while for MCP, b = 2 / \kappa(X). \kappa(X) is established for each example as in Lemma E.4 and E.6. The choices of L_min and b will be explained more explicitly in the camera-ready version. Our assumption on the nonconvex penalty (Assumption 3.3) is slightly stronger than the corresponding one in Loh&Wainwright (2003), since we further require the concave part q_\lambda() to satisfy the smoothness property (Assumption 3.3 condition ii), which is essential to achieve the faster rate of convergence. We note that Loh&Wainwright did not prove a faster rate of convergence for nonconvex penalty. Though our additional smoothness condition excludes capped-L1 penalty, a lot of widely used nonconvex penalty such as MCP and SCAD satisfy the conditions in Assumption 3.3 and our theoretical results can be directly applied. Response to Reviewer_2 Thank you very much for your supportive comments. We will carefully check the notations and make sure every notation is defined in the main content of the paper. Regarding the verification of matrix sensing on real datasets, we did not find a good application of matrix sensing in real problems of interest. We will keep searching for such real applications. Response to Reviewer_4 Thanks for your positive comments. We would like to emphasize that the major contribution of our paper is to establish the faster rate of convergence for low-rank matrix estimation by using nonconvex penalty. We extended the homotopy proximal gradient algorithm by Xiao & Zhang(2013) to optimize our proposed estimator. The faster rate of convergence is mainly due to nonconvex penalty rather than homotopy proximal gradient. Considering Theorem 3.5(Oracle property), we demonstrated in Figure 1(c, f) that with high probability the true rank would be exactly recovered. For the convergence rate, it is a special case of Theorem 3.4, the verification of which is shown extensively for both synthetic and real world datasets. We will highlight this part in the camera ready to make it clearer. Thank you for pointing this out. To summarize, thanks again to all the reviewers for the helpful and insightful comments. The issues raised by the reviewers are easy to address and will be incorporated into the camera-ready version.