We thank all the reviewers for their helpful comments. To help address several of reviewers' questions, we have included new timing results and a new synthetic experiment. Also, we briefly compare SQR models with hierarchical Bayesian models.$
--- Computational Time (Reviewer_2, Reviewer_4) ---
In our submission, we used numerical integration to estimate the node-wise log partition function, A(eta1,eta2). However, we discovered after our submission that the Exponential SQR node conditional log partition function actually has a closed form:

A(eta1,eta2) = log[ sqrt(pi)*eta2*exp(-eta2^2/(4*eta1))/(2*(-eta1)^(3/2))*(1 - erf(-eta2/(2*sqrt(-eta1)))) - 1/eta1 ],

where erf(x) is the error function.  This significantly improves the performance of our algorithm.  We do not know if other SQR models (such as the Poisson) have a closed form log partition functions though we believe upper and lower bounds could be obtained in near constant time.  With this improvement, the time to fit the airline model in MATLAB was:

[Top 30 airports, p=30,  n=365, time= 27.95 s]
[273 airports,    p=273, n=365, time=569.45 s] .

We set the maximum number of gradient iterations to 1000 and computed the model on a computer with an Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz (4 cores, 8 thread capability) and 16GB of memory.  We used the MATLAB parfor loop to trivially parallelize each node-wise problem onto 8 threads. These results suggest that model estimation is indeed feasible for large parameter matrices. In addition, these times could likely be significantly improved by implementing in C++ and using a Newton-like optimization method as in [Inouye et al., 2015].

--- Synthetic Experiment (Reviewer_2) ---
Per Reviewer_2's suggestion, we implemented a simple synthetic experiment to help verify that the algorithm is finding true patterns. We set p = 30 and theta = 0.  We set Phi to a single simple cycle graph with 0.49 positive parameters between neighboring nodes and -1 along the diagonal, Phi =

[-1, 0.49, 0, ... 0.49;
 0.49, -1, 0.49, 0,..., 0;
 0, 0.49, -1, 0.49, 0,..., 0;
 ...].
 
Phi is negative definite by construction (proof by Gershgorin disc theorem), and there are a total of 30 edges in the node graph. Also, note that every node is positively correlated with every other node.  We generate n = 100 samples from this distribution using Gibbs sampling, which we have developed since the original submission. Inside of Gibbs, we iteratively sample from the node conditional distributions using slice sampling.

With a lambda of 1e-05, the top 30 edges in the estimated Phi match the 30 true edges (i.e. a top-30 precision of 100%). These results suggest that our algorithm is finding true patterns in the data. Note that this model was estimated with only a small number of samples (n = 100).

--- Comparison to Hierarchical Bayesian Models (Reviewer_5) ---
The hierarchical Bayesian multivariate extension of univariate models (i.e. marginalize out the hierarchical prior) can usually be written as:
Pr(x | theta) = \int Pr( mu | theta ) \prod_i Pr(x_i | mu_i) dmu ,
where Pr( mu | theta ) is the prior distribution, and Pr(x_i | mu_i) are simple univariate distributions. As pointed out by Reviewer_5, if the prior has dependency structure, then some of this structure is passed on to the marginal distribution of x. The most common correlated prior distributions are the multivariate normal distribution and the log-normal distribution when mu_i must be positive. (On a side note, SQR models could be used as priors in place of these standard prior distributions in hierarchical models---e.g. use an Exponential SQR instead of a log-normal prior.) One advantage of this generative formulation is that exact sampling is possible. Another possible advantage is that the mean or covariance of the final distribution may be known in closed form (often related to the mean and covariance of the prior). However, the integral form is cumbersome to handle and cannot usually be simplified. 

In terms of estimation, exact inference of hierarchical models is often intractable due to the presence of latent variables, and thus approximate variational inference or MCMC methods are often used. In contrast, SQR models likely have similar theoretical guarantees as previous node-conditional models [Yang et al., 2015, Tansey et al., 2015], and the node-wise regressions can harness the power of convex optimization algorithms. Thus, parameter estimation is likely both theoretically and practically more efficient. Further comparison would be an interesting area to explore.