Reviewer 1:$*"One weakness of the paper is how widely applicable this formulation is"
The proposed method focuses on the "localized" structure of an MN, since it requires much less information. It can be applied whenever such a local structure is wanted, but "global information" is unavailable. 

"How does the algorithm perform when the full \Theta is sparse? " 
This is an interesting question. The proposed method is a "local method" that *directly* targets on learning the between-group graphical structure. If links between two groups of random variables are the only connections in this graphical model, then the methods learning the global structure will become a *direct* method, because in this special case, the global structure and the between-group structure are equivalent. Under this circumstance, both methods should achieve a comparable performance. 

However, in the targeted applications, such as bipartisanship analysis or clustered gene link discovery, the overall sparsity is a very strong assumption and the proposed method can work efficiently *without* such an assumption. 

*"It would be nice to see the results from competing algorithms on the real data sets in the Empirical section"
Thanks for the suggestion. We will add some comparisons in a revised version. 
For the 106th senate voting dataset, please refer to Figure 16 in Banerjee et al., (2008) where a global graphical model structure was learned. The resulting graph is very dense within each political party and there are only a few between-party links. This to some extent verifies our assumption: The overall graph is dense while the between-cluster links are scarce. 

Both methods discovered the bipartisanship related to Ben Nelson and Lincoln Chafee. Moreover, the proposed method found a group of senators seem to "disagree" with each other (the small blue cluster in Fig. 5) while no such structure was found in Banerjee et al., (2008). 

Reviewer 2. 
*“It is not clear how the experiments were performed to compare the proposed method with the previous methods in Sec 6.1.”

Sorry that we have omitted this due to the length limit. 
The differential learning method introduced in Section 6.1 learns the difference between two precision matrices. In our setting, if one can learn the difference between the precision matrices of p(x) and p(x1)p(x2), one can figure out all edges that go across two groups (x1 and x2). 

This method requires sample covariance matrices of p(x) and p(x1)p(x2) respectively. The sample covariance of p(x) is easy to compute given joint samples. However, to obtain the sample covariance of p(x1)p(x2), we would again need the U-statistics introduced in line 485-487. We may approximate the u,v-th element of the covariance matrix of p(x1)p(x2) using the following formula:
\Sigma_{u,v} = 1/{n \choose 2} \sum_{j \neq k} x_v^{[j,k]}x_u^{[j,k]},
assuming p(x) is zero mean. The above is again a two sample U-statistics (Hoeffding, 1963).

After computing the sample covariances, we may plug them into the objective function (2) in Zhao et al., 2014 and estimate the differences between precision matrices. 

The details above will be included in the supplementary material. 

*"While the results in Sec 6.2 are interesting, a "quantitative" comparison with other approaches on real datasets would be valuable."
Thanks for the suggestion, and we will introduce such a comparison in a revision. 
However, the real-world data used in this paper does not have a gold standard which may make the quantitative comparison a bit difficult. Please also find the answer to a similar question raised by Reviewer 1. 

Reviewer 3. 
*"Specifically, I am not sure that the extended discussion of the Hammersley-Clifford theory (and respective generalization for the partitioned case) is critical"

The extension of Hammersley-Clifford theory (Section 2&amp;3) is introduced for two reasons:
1. As the reviewer mentioned, it gives the motivation of our learning algorithm: If one only wants the PMN structure, one should avoid learning a joint density which indicates *all the edges* in the global structure.   
2. It also serves as the theoretical foundation of our learning algorithm: If we do not introduce such a factorization theory, we will not know how to relate the sparse factorization of the partitioned ratio to the PMN structure. There is no known theory for that, and the correspondence between the factorization and the PMN structure is not very straightforward. 

In other words, Section 2 motivates us to explore a different learning algorithm rather than using the existing maximum likelihood estimation, and factorization theorems in Section 3 tell us how such an algorithm should work.

====
Typo Correction: Theorem 5,
> "Successful Passage Recovery: \hat{S} = S^* and \hat{S}^c = S^{c*}" should be
Successful Passage Recovery: \hat{S} = S and \hat{S}^c = S^{c}.

> line 585-586: \max_{t\in S} should be
\max_{t\in S \cup S^c}