Paper ID: 1107 Title: Geometric Mean Metric Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a new algorithm for distance metric learning. The goal of learning is (roughly speaking) to discover a linear transformation that increases the distance between examples in different classes and decreases the distance between examples in similar classes. The linear transformation is learned as a Mahalanobis distance, parameterized by a symmetric positive definite matrix A. The authors have the very clever insight that to increase distances under A, one can decrease distances under inv(A). This leads to a simple objective function and an elegant, closed-form solution based on geodesics in the space of positive definite matrices. The approach works well, and it's faster than previous approaches. Clarity - Justification: The paper is a model of clarity. Significance - Justification: The authors develop a faster, simpler algorithm for distance metric learning, which has many applications. The experimental results are sound, though I have nitpicks in the detailed comments. As the proposed method is orders-of-magnitude faster, it should be possible to see results on a much larger problem that none of the other methods can handle. (I think there was a missed opportunity here.) But the real significance of this paper is the novelty and elegance of the approach. There have been hundreds of extensions to LMNN, ITML, NCA, etc, but the approach in this paper is of a different kind. In addition, it leads to mathematical ideas that could conceivably have applications in other forms of learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I would like to see more discussion (and perhaps results) on how the method behaves as a function of the number of constraints. If I understand correctly, the authors randomly sample these constraints (lines 476, 536). But this seems suboptimal for improving kNN classification; it seems much better to concentrate on increasing the distance between easily confusable examples in different classes, and to concentrate on decreasing the distance between neighboring examples in the same class. These are the examples that will strongly affect the performance of kNN. Put another way, the authors seem to cripple their algorithm by purposefully weak choices of the sets S and D. This may also lead to a more nuanced comparison with other algorithms. The strength of (say) LMNN is that it enforces all *triples* of dissimilarity constraints, and that is probably the more important reason why it is time-consuming. (LMNN requires an iterative approach because there is no closed-form solution, but in practice the number of iterations—hence, computation time—is strongly driven by the number of dissimilarity constraints.) The MNIST results in Figure 2 must be incorrect? First of all, error rates for kNN classification on MNIST are in the 1-2% range, whereas all algorithms are showing over 10-20%. Second, it is shown here that LMNN performs significantly worse than a Euclidean distance metric on this data set; if the authors find this, they really must explain why it contradicts all previously reported results. (Note: this is also the one large data set, if I understand correctly, where GMML is claimed to significantly outperform LMNN.) Finally, I presume that n=60000, not n=4000 for MNIST. Did the authors use the whole data set? Please clarify. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a novel technique for learning euclidean distance metrics for classification. The proposed method is simple, but apparently effective and efficient. Overall, I quite enjoyed reading this paper. Clarity - Justification: The ideas are clearly and throughly expressed, and the text is easy to follow. Importantly, the proposed method is explained in clear terms which should be easy for an independent reader to re-implement. The experiments are explained well. My only suggestion would be to expand section 2.1 (see below). Significance - Justification: The proposed method brings some novel and interesting insights to metric learning, and may substantially influence the design of future algorithms. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Section 2.1 (insights) provides some intuition behind the choice of the objective function in (5), but still feels a bit vague. Since there is ample space left over for expanding this manuscript within the ICML guidelines, I'd recommend expanding this portion. Specifically, the claim "d_A(x,y) increases monotonically in A ... d^_A(x,y) decreases monotonically in A" doesn't directly relate to (5) since any pair (x,y) appears either under d_A or d^_A but not both. Similarly, the second portion relating to the gradients of d_A and d^_A also does not apply directly to (5). It seems like what's needed here is a direct link to minimizing same-class distances and maximizing different-class distances. This ultimately becomes a bit more clear once the pairwise distances are replaced by aggregated scatter matrices, and similarity is determined by geodesic distance, but making this connection more precise up front could help provide a more direct motivation for (5). The weighted extension in Section 2.4 seems to be important for achieving good performance in practice. I wonder how much of this can be resolved by normalizing S and D? If the number of classes is large, then the dissimilarity constraints will vastly outnumber the similarity constraints, which may bias A. Some discussion or investigation of this point might be helpful, since it's not necessarily obvious how these factors play out under geodesic distance. Minor corrections: Page 3, line 394: "is suffices" => "it suffices" Page 4, line 364: "we following" => "we follow" Page 4, line 372: "than on" => "that on" ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes an interesting approach for learning a discriminative metric using concepts from the Riemannian geometry. In practice, the authors propose to learn a relevant metric as the interpolation (on a geodesic) between the covariance matrices computed respectively on similar and dissimilar points. Then the authors proposed regularised versions of this approach and weighted approaches (where point -other than the midpoint - is chosen on the geodesic). Clarity - Justification: The authors have written a clear paper and its structure makes it easy to read. Significance - Justification: The basic idea of the authors is rather interesting : using a metric for the points of the same class and the inverse of this metric for points belonging to different classes This idea, when developed, produces this nice and simple algorithm. However, the same metric is used everywhere in the space when classifying the points, so, the use of the inverse of the metric is rather surprising. The authors should give an explanation or some insights about this curious cost function. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper proposes an interesting approaches but I have several concerns about it. Major concern : - the part about the convexity of the cost in the proof of theorem 3 is wrong. Indeed, the space of the PD matrices cannot be considered as a cone. It is strictly the inside of the cone, which makes the rest of the claim invalid. This is a very important point, as the border of the cone consists of rank deficient SPD matrices. - Going from Eq 9 to Eq 10 is not obvious. Moreover, is it true for any S (or should it be a PD matrix ?) - line 308, should not the right hand side of the equation be with #_{-t} (instead of #_t) ? - Going from the Equation in line 416 to the Eq 18, is also not obvious. - in the numerical experiments, why did the authors used a 5-NN ? it is more usual in the literature to use a simple 1-NN instead. - additional experiments (on toy datasets) showing the effect of the parameter lambda (and the choice of A-0) should be added. Moreover, a Figure showing the effect of the parameter t on the accuracy should be plotted. - In my opinion, it is unfair to compare a full rank metric (as learned in this approach) to a full-rank metric learned by a method specific to the low-rank case. Moreover, not counting the parameter selection time for the authors' approach is unfair in the comparaison. - As future work, learning a sparse but PD matrix for the metric seems not to go along (in my opinion). The Riemannian structure may break as the sparsity increases. specific comments : - there are several typos in the draft. - contrary to what is specified in Eq 1, most of metric learning algorithm used a semi-positive definite metric A (as specific in Eq 2 and 3). To be precise, most algorithms focuses on low-rank metrics. This is a rather big difference compared to the authors' choice of PD matrix. This should be discussed in depth. - When reviewing the papers on metric learning, I think that the following paper : "A Survey on Metric Learning for Feature Vectors and Structured Data" by Aurelien Bellet et al. should be cited. - about geodesic convexity, the works of Sra "Geometric Optimization in Machine Learning" should be cited. - what does "increases monotonically in A" means on line 193 ? as A is a matrix, is the cost monotonic wrt to each elements of the matrix ? - computation from Eq 13 to Eq 14 are not obvious, please give a reference (or give the explanation in appendices). As there is pages left in this paper, those details should have been added. Questions : - In the regularised version, what are the possible choices for A_0 ? Would that be relevant to use the covariance of the whole dataset as A_0 ? - did you observe any change of accuracy of the authors' approach as the dimension of the dataset grows ? =====