Paper ID: 1197 Title: Gromov-Wasserstein Averaging of Kernel and Distance Matrices Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a method for finding a barycenter of a set of similarity matrices, without the need of any registration of the underlying points. The method is applicable even when the dimensions of each similarity matrix is different. The main technical tool is an extension of the Wassertein or transport distance to the realm of matrices. The paper builds on recent work in fast computation of the Wassertein distance and Wassertein barycenters. The paper includes experiments showing that the method finds reasonable barycenters for sets of point clouds represented as similarity matrices, as well as an application to a problem in computational chemistry. Clarity - Justification: The main point of the paper was easy to understand, as were the main technical components. The algorithm itself was also presented clearly. Below there are some comments regarding a few specific issues which I found to be presented less clearly. Significance - Justification: The paper introduces a novel method for computing barycenters of similarity matrices. This in itself is a noticeable contribution to the existing literature. On the other hand, the paper does not show any conclusive or compelling advantages or uses of this method. The authors do no discuss the overall computational burden of the method, but it seems that each computation is (roughly) cubic in the number of datapoints per similarity matrix. This cubed-time computation is performed iteratively several times until convergence, and is compounded by the need to compute the optimal transport matrices T_s - though I suspect that the computation of T_s might not be the bottleneck. Finally, the experimental results are somewhat underwhelming. The authors use similarity matrices as a way to represent point clouds and retain rotation invariance. However, the experiments do not show that this is a compelling motivation to use similarity matrices, and much more impressive results have been achieved using explicit Wassertein barycenters, see for example [1]. The application to he problem of quantum chemistry was more compelling, however the results were not especially strong compared to standard methods such as kernel ridge regression. Comparisons with the experiments of Dryden et al. (2009) [2] would have been welcome. Alternatively, I believe that applications in the field of domain adaptation with kernels could be interesting, where the similarity matrices (kernel matrices) are given for different domains with different sample sizes. [1] Solomon, Justin, et al. "Convolutional wasserstein distances: Efficient optimal transportation on geometric domains." ACM Transactions on Graphics (TOG) 34.4 (2015): 66. [2] Dryden, Ian L., Alexey Koloydenko, and Diwei Zhou. "Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging." The Annals of Applied Statistics (2009): 1102-1123. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. Is the proposed method generalizing some previously known barycenter method? Specifically, does the proposed method become simpler if all p_s, and p, are assumed to be uniform, and all N's are assumed to be equal? 2. Lines 256-257: I found the discussion or registration here confusing. I think this section would be clearer if the authors gave a concrete example of what is meant by registration in this context. 3. Equation (8): This iteration could be better motivated for people not familiar with recent optimal transport techniques. In particular, the use of the element-wise multiplication between the transport matrix T and an exponent of the gradient is not obvious. 4. What are the typical runtimes of the experiments shown in the paper? minor comments: I recommend using \citet instead of \citep in many instances. For example line 168: "(Sra,2011) defines... " should read "Sra (2011) defines..." line 67: "varying number of molecules" --> I believe it should be "varying number of atoms" line 120: missing citation lines 144-145: missing word in this sentence lines 196-197: typo, "is" should be "as" line 704: n \times n should probably be n_x \times n_x lines 704-705: There seems to be a typo in thie lines ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This seems like a reasonable paper, but it is outside my wheelhouse. Clarity - Justification: As an outsider, I was able to follow most of the derivations. Significance - Justification: I'm not in a position to judge. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I will only ask, if GW-3nn is not directly competitive with alternative approaches (as in the results of Section 4.2), could it be combined with alternative methods (e.g. as input to a neural network) to achieve superior performance? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a method based on Gromov-Wasserstein (GW) distance to align a batch of similarity matrices while figuring out their barycenter. It extends the GW distance for more general matrices, where entropy is used as regularization. Based on the GW distance, it defines the barycenter of matrices with various size as the Frechet mean. It proposes a relative efficient algorithm, based on technique like factorizing tensor-matrix product. Finally, it also conducts some simple case studies to show the practical performance of the proposed methods. Clarity - Justification: The presentation is clear. It is not hard to get the gist of the concept and method. The detailed description is also tractable to me. Algorithm 6 provides a good reference for implementation, but replication may require more detailed understanding and experience. The abstract is a bit confusing at the first glance, because it is not clear what the similarity matrices for. I think it will be better to mention that the similarity matrix is for representing data. Otherwise, one may get confused by thinking why to compute the barycenter. Significance - Justification: The contribution of this paper is mainly theoretical. In particular, extending the GW distance, defining the barycenter of heterogeneous matrices, and solution to the alignment problem all look like good contribution. The idea of aligning different dimensional data with a reference point (barycenter) is also interesting. However, due to the limitation of my knowledge to this topic, I am not sure about either how important the problem is or how significant this line of research is. Just to be open, I assume these theoretical contributions are all meaningful. The experiments provide OK demonstration for the theories. Practically, the results are not strong. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper looks more theoretical than practical. In this sense, its contributions look good. (refer to my comments in “Significance - Justification”). However, for me, it is a bit hard to judge. Since the main related work were done several years ago, I am not sure about the significance of this line of research. As to the experiments, I have several questions. 1. For point cloud experiments (i.e., on MNIST), there is no comparison with other methods. Is there any previous method can do similar job? More specific to point cloud warping and registration, iterative closest point (ICP) method can be a practical baseline. Point clouds can be registered and corresponded by ICP given a reference. Then removing the reference while keep the register correspondence, joint alignment method (like funneling, RASL …) can be used for batch registration. 2. The visualization of the barycenters is reasonable, but not great. For example, in Figure 2, the barycenter of the bones (first row) does not look like the bone. Same happens for Row 4. Some important semantic attributes are lost in the barycenter. 3. Can the proposed method handle more general transformations (not only limited to 2D rotation)? Experiments on realistic 3D point clouds (e.g., 3D points) can be very helpful. 4. Why not show alignment results? I mean the registered version of the point clouds. 5. The results in Table 1 is also not promising compared to other types of methods. My concern is whether the proposed method can be finally useful in practice. Since the practical contribution is limited, I may adjust my rating based on the author’s response and further discussion about the theoretic significance. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The contribution is two-fold: the defining of a (regularized) Gromov-Hausdorff(Wasserstein?) distance between matrices, and the introduction of a 1-median via the Freshet mean. Applications of these ideas are provided. Clarity - Justification: It's a pretty readable paper. Significance - Justification: The contribution is new. But there's not that much of it. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Given that I'm familiar with the basic G-H ideas (and Frechet means), I didn't find anything too surprising in this work. One bit that went unexplained though was the entropic regularization. It seems to come from Cuturi's work on regularizing the EMD by looking for "almost independent" marginals in the transport function. But in that work there was a strong argument made for the utility of this regularizer: such an argument is missing here. Overall, this seems like a reasonable way to compare matrices especially without any additional structure like being s.p.d etc. One challenge with this setting is if the matrices are of different rank (that is the formulation that causes trouble for things like the Grassmanian-based metrics - which you should cite - see work by Edelman). The approach given here neatly sidesteps this (because it essentially doesn't care). But I wonder how it would behave for matrices of different rank: would the results be meaningful ? The experiments are a little tricky though. The example with MNIST works, but then so do many things. The example with the shapes is good, but in both these cases there are other methods that also work reasonably well: for example the work on Frechet means on manifolds by Fletcher/Joshi/Venkatasubramanian from CVPR a few years ago). These methods are much more specific and require more structure, but that's the point: the effectiveness of this paper would come from its applicability in very general settings. The clustering examples are good. I'm not sure though how to interpret the results on the molecular data: on the surface, the results aren't that great, but it sounds like the authors are arguing that by using this metric they can use a much simpler classifier. That is plausible. =====