We thank the reviewers for their feedback, suggestions and encouragement.$ REVIEWER 1 We apologize for lack of clarity that caused some misunderstandings; we will fix these. – "...convexity is wrong" and "PD matrices cannot be considered as a cone" The set of PD matrices is a convex cone. Notice that we are using the standard definition of a convex cone [Thm. 2.6, Rockafellar (1970)], which defines cones using strictly positive multiples, i.e., K=\lambda*K where \lambda > 0 (the reviewer has in mind pointed cones, i.e., those containing the origin and corresponding to \lambda ≥ 0) Consequently, the statement about convexity is correct. – "9 to 10 is not obvious" The reviewer is right. We assumed S is PD by construction (7), so did not rewrite it here -- lack of strict PD in S can be handled via regularization. We will add an extra sentence to clarify this part. – "Line 308 ... should be #_{-t}.." No, the equation is correct as written, because (A #_{t} B)^{-1} = (A^{1/2}*(A^{-1/2}*B*A^{-1/2})^{t}*A^{1/2})^{-1} = A^{-1/2}*(A^{1/2}*B^{-1}*A^{1/2})^{t}*A^{-1/2} = A^{-1} #_{t} B^{-1} – "416 to Eq. (18) not obvious" We will add a short proof to improve readability. – "why 5-NN" Our choice of k in k-NN is based on existing metric learning methods. ITML used k=4; LMNN used k=3, and FlatGeo used k=5. – "additional expts" We are happy to add a new figure showing average behavior of the classifiers with different values of t for different datasets. We observed that each of those plots has a convex-like shape. – "low-rank; unfair comparison" We observed performance of LMNN and FlatGeo with low-rank metrics is not better than their full-rank case. The main benefit of low-rank is in decreasing training and test times, though with possibly worse accuracy. For example, in the FlatGeo paper, the authors compare classification accuracy of their method with other methods in the full-rank case. One can interpret this to mean that they accept that the best performance of their method happens at the full-rank case. – "future work" Thank you for sharing with us thoughts regarding future work. We are working on an alternative solution inspired by an indirect use of GMML compatible with sparsity. *COMMENTS / QUESTIONS* * We apologize for the typos, and will correct them immediately. * LMNN and MMC do not require full-rank matrices but often their best performing metrics are full-rank. ITML uses the logdet loss, which is tantamout to requiring full-rank. We agree this issued should be discussed in more depth. * Thanks for the references; we will add citations to Bellet et al and Sra. * For matrix monotone functions, please see [Sec. 1.5, Bhatia (2007)] (Positive Definite Matrices). Specifically, it means monotone in the PD (Loewner) order: f is monotone increasing if f(A) ≥ f(B) for A ≥ B, i.e., whenever A-B is PSD. * We will add a better explanation for (13) --> (14). * choice of A0: This depends on prior knowledge. Choices like the identity or inverse covariance are usually reasonable. As noted in the ITML paper, if data are Gaussian, using the inverse sample covariance may be appropriate. * accuracy change with dimension: This concern arose in our experiments too; upon trying several datasets we saw that this conclusion is probably not true. REVIEWER 3 – "Insights" Thanks for feedback regarding this part. We will add further insights on the proposed objective functions. Further, our remark about d_A(x,y) was made with test data in mind. We agree though that the buildup to (5) should be more detailed. – "normalization" Actually, simple normalization of the matrices S and D does not affect performance. We discuss this point in the first paragraph of Section 2.4. We will add more discussion to make this point clearer. Thanks for catching the typos. REVIEWER 4 – "Effects of sampling" The reviewer is right. Our way for choosing the number of constraints is suboptimal. We have searched for new formulas for the number of similarity and dissimilarity constraints, so that in addition to the number of classes we also take into account the dimensionality and the number of data-points within each class. But we have not yet discovered a satisfactory formula that works well in most situations. – "MNIST" We also wished to apply our method on much larger problems, but in this situation, we miss the ability to compare with other metric learning methods. We encountered this problem when working with the original MNIST dataset, and therefore were forced to use a smaller version of it. We downloaded it from www.cad.zju.edu.cn/home/dengcai/Data/MLData.html We will mention this fact about the dataset used explicitly.