Paper ID: 478 Title: Representational Similarity Learning with Application to Brain Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This manuscript formalizes a data-processing method used in neuroscience, representational similarity analysis (RSA), as a multi-output regression problem. It introduces a weighted group-sparsity regression formulation. Finally, it extends RSA from its definition on a local-neighborhood of the brain images to a full-brain approach. Methods are validated on a rich simulation of brain activity and real brain data. Clarity - Justification: I must say that I didn't fully understand from my reading of the beginning of section 4 if NRSA solely differs from RSA by relaxing the local neighborhood to a large set of voxels. Significance - Justification: The formalization of RSA into a well-posed model is a welcomed addition to the field. The extension of OWL to group-sparsity is most likely useful, in particular in these settings, yet other approaches would probably solve well the problem. I find that the validation with a rich simulation based on a model of function is original and interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The graphs on figure 4 are slightly misleading: only the left of the graph is really interesting on neuroscience applications, unlike ROC curves for prediction. I would suggest investigating either truncated ROCs, or precision-recall curves (that can be related to FDR). The is probably more work to be done to be able to extract visualizations and messages from the model that are really useful for neuroscience. Indeed, figures like that on figure 5 do not map directly to the standard mental pictures. In a sense this is a benefit of the ugly trick that standard RSA uses by restraining to a local neighborhood. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors develop, validate and utilize an improved algorithm (both a representation of a GrOWL norm and its proximity operator) for sparse multi-task regression. GrOWL extends Ordered Weighted L1 regression (OWL) to be able to identify representations from multiple groups of predictors. The main features of this technique are that it promotes a sparse representation, in the face of strongly correlated predictors. Furthermore, as in OWL, the GrOWL norm encourages clustering of weights. The model is first validated on an abstraction of predicting word sounds from spelling and semantics, as represented by a deep neural network. Second, GrOWL is applied to fMRI data from a human visual discrimination task. GrOWL out-performs group-lasso on the simulated task, and performs comparably on real data. Notably, GrOWL seems to identify a larger number of non-zero predictive weights in each task, purportedly a result of being able to assess relationships between intra-correlated inputs and outputs. Clarity - Justification: Overall the paper is well-organized and progresses in an easy-to-follow manner. Two main factors hinder the paper’s clarity; first a number of terms related to the fMRI experimental design were used without explanation, and parameters of the design were described without motivation or context. Figure 2 is a good representative case-study in the difference in results between GrOWL and other approaches. I could have benefitted from a review of key distinguishing features between multi-task and single-task regression, the meaning and significance of proximal algorithms, and the definition and motivation for Chamfer distance as a surrogate for perceptual similarity . Significance - Justification: This technique seems to overcome some fundamentally limitations in how functional networks are identified from fMRI data. My only concern here is being unable to identify clusters of functionally connected activity seems to be such a fundamental limitation that I find it difficult to believe that methods do not currently exist for this purpose. Or, if such techniques do exist, but GrOWL provides significant advantages in some aspect other than identifying non-local clusters, this was unclear from the text. The significance could be made much stronger by appealing to the specific usage of identifying potentially-non-local functional networks from fMRI data, either to build our understanding of how subsystems within the brain interact. The authors mention “other applications” but do not specify them. Identifying some more, concrete applications would strengthen the generality of usefulness of this technique. The example motivation from a chemistry education perspective seems a little unrealistic as an actually use for this algorithm, but perhaps is a useful demonstration of the components and purpose of the GrOWL algorithm. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall the paper is interesting and demonstrates a useful step forward in methodology for clustered, sparse regression. Given the story outlined in the introduction, it is unclear why chamfer distance was used to quantify a subject’s perceived similarity between objects rather than a measure which required an individual to report their perception. What are the implications of GrOWL-II identifying more voxels? It would be useful to compare the connectivity estimated by GrOWL-II to rough anatomical predictions to contextualize whether the algorithm is truly uncovering connections other methods cannot find, or whether it is likely to generate false-positives in this setting. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present an algorithm for representational similarity learning and validate it on simulated phonetic and real fMRI data. Their primary contribution is a regularisation term of the regression cost function that can deal with correlated variables, which is relevant in the context of fMRI analysis. Clarity - Justification: I found the manuscript very well written and mostly easy to follow. My only criticism in terms of clarity is the quality of the figures, some of which are hard to read. The authors should replace their bitmap figures by vector graphics and increase the font sizes. Significance - Justification: In my opinion, the authors make two important contributions. First, they present an approach to learning representational similarities, in contrast to performing an exhaustive search over all ROIs, and, second, they show how to include correlated variables in the analysis without degrading performance, which is important in terms of interpreting the obtained fMRI maps. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this is good work. I only have a few comments: * Section 1.1: Why is the cost function with the decomposition of S into YY' easier to optimise? And in how far is the assumption of PSD matrices restrictive? I thought pairwise similarities are symmetric? * Section 2: Parameter r is not introduced before using it to describe B. Maybe mention this when introducing S = YY'? * Typo in Section 4.2: "similarity similarities" * Section 4.2: In how far do the results depend on the choice of r=3? This seems pretty arbitrary? * Section 4.2: The statistical test for non-random predictions is not described and p-values are not given. * Section 4.2: I am not familiar with Chamfer distance and would appreciate a reference/definition. * Typo in Figure 3: "LASSO (left)" =====