We thank the reviewers for their helpful feedback, and will be sure to fix all typos noted in the reviews. Below we address the points raised in each review.$
Reviewer 1:

The squared optimization (1) achieves a low rank W matrix via a nuclear norm penalty, which is computationally impractical. The square-root optimization (2) can be posed so that the solution is inherently low rank and bypasses the nuclear norm.

The rank of the approximation is chosen as the smallest r which holds at least 85% of the similarity structure, ||S – YY’||_F/||S||_F < 0.15. This was a typo in the paper which will be fixed.

We provide a trick in the supplementary material to drop the PSD assumption.  The symmetric similarity matrix can be written as S = YDY’ where diagonal entries of D are the signs of eig(S) and Y contains appropriately scaled eigenvectors as columns. The final solution can be written as W = BDB’ where B is the solution of the optimization in (2).

Regarding Chamfer matching, please see response to similar question from Reviewer 3 below.

Our assertion that both approaches show significantly non-random predictions is based on a permutation-based paired t-test, where chance would have been zero difference. By this measure,  for example, GrOWL-II's performance  is significantly better than chance (t-value=8.59, p<0.0001). We will add this explanation.

Image quality will be improved for any final manuscript.

Reviewer 2:

RSA differs from NRSA in two respects: 1) RSA is restricted to local neighborhoods; 2) RSA considers the (unweighted) correlations between voxels in each local neighborhood. In contrast, NRSA is non-local and learns a weighting matrix W that reflects differing levels of interaction between voxels.

We agree with the reviewer that the left of the ROC graph is more practically relevant, and it is clear that GrOWL does much better than Group lasso in this region of the curves (Fig. 4).

We agree that more work is required to compose visualizations that provide clear insights to neuroscientists.

Reviewer 3:

The key difference between sparse multi-task and single-task regression, is the nature of the sparsity patterns: group-structured sparsity (multi-task) vs. standard sparsity (single-task).  Proximal point algorithms are designed for the specific form of sparsity, and thus are different in the multi-task setting. The derivation of the GrOWL and its proximal point algorithm is novel.

Chamfer distance [1] measures pixel-wise differences between binary bitmaps. The human visual system is organized hierarchically, with simplest features coded in a retinotopic map in the posterior and medial occipital cortex (V1)[2]. We expected that the visual structure expressed by Chamfer distance would be related to the low-level visual structure represented in V1, which is different from subjectively perceived visual similarity. Consistent with this expectation, our method selects many voxels in and around V1.

The problem we address differs from usual brain network analysis in that we are not interested in correlations in general. Instead, we seek a network of voxels with correlations that match a specific set of target similarities. We are not aware of prior work that deals with this problem, apart from the RSA approach which does not allow for distributed and weighted network representations.

The simulation study depicted in Fig. 3 shows how GrOWL is able to identify subnetworks that interact to encode representations of similarity information. The real data experiments depicted in Fig. 5 also show interacting subnetworks (brain areas).

Regarding other applications, we note that GrOWL is appropriate for general multi-task regression problems involving strongly correlated covariates. Such problems routinely arise in bioinformatics and other areas. The chemistry application mentioned in the introduction is part of an ongoing collaboration with researchers in chemistry education.

Neural coding is thought to be often highly redundant. GrOWL allows models with redundancy, whereas Group Lasso discourages the selection of redundant voxels.
Our simulation study (Fig. 3) shows that the GrOWL voxel selections are more stable
than the Group Lasso selections, (brighter color indicates more stability in the selections across trials). Although not discussed in the paper due to space limitations, we also observed this in the fMRI data.

[1] Borgefors, G. (1988) Hierarchical Chamfer Matching: A Parametric Edge Matching Algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 10(6), 849-865.
[2] Kriegeskorte, N. &amp; Kievit, R.A. (2013) Representational geometry: integrating cognition, computation, and the brain.Trends in Cognitive Science, 17(8), 401-12.