We thank the reviewers for the time devoted to provide thoughtful comments and questions. We will address the comments in the final version of the paper, and we attempt to answer the questions below.$ Reviewer 2: The main reason that MLE-SPN is bad compared with CVB-SPN is that MLE-SPN uses projected gradient descent directly in its parameter space, so that each intermediate estimator during optimization is only a local update away from the previous one. On the other hand, CVB-SPN takes local steps for its hyper-parameters (parameters of the Dirichlets), but can take much larger steps for its parameters: line 2 in Alg. 1 sets the current parameter to its posterior mean. Put another way, MLE-SPN is a local search method in parameter space, while CVB-SPN locally searches over hyper-parameters but always uses the global posterior mean for model parameters. For the comparison with Gen’s paper: the scores reported there are for combined structure and parameter learning, not for parameter learning as in our results. In Gen's paper they obtained the model parameters during the structure learning process: they use recursive clustering to build each node in the SPN and then set the weights corresponding to the size of each cluster. The resulting scores are not directly comparable with ours, as their method is not applicable when an SPN with fixed structure is given, or when the data set used to build the structure is different from the data set used to learn the parameters. The results reported in Table 2 do not differ too much among multiple runs of random initialization; we reported the average over runs. We will report the variance over runs in the final version, in addition to the Wilcoxon rank test reported in Table 2. The purpose of scaling the weights returned by LearnSPN is to determine the concentration of the prior around the LearnSPN solution. LearnSPN does not tell us the appropriate concentration, so we need to pick it by a procedure external to LearnSPN. We used a single scaling constant fixed across all 20 datasets, chosen by coarse optimization on a subset of the data. It is a good point that the inclusion of the prior for CVB-SPN makes it more difficult to fairly compare CVB-SPN with MLE-SPN. To shed some additional light on this comparison, we performed some additional experiments (which we will report in the final paper if accepted). We derived a gradient-based MAP-SPN update that includes the same prior information we used for CVB-SPN; this derivation is not immediately obvious due to the lack of conjugacy, but we can show that the MAP update corresponds to adding a regularization term that encourages the gradient-based update to point toward the normalized pseudo-count vector of the Dirichlet prior. We then tested MAP-SPN on the nltcs, jester, netflix, and plants datasets. On 3 of the 4 datasets, MAP-SPN gives slightly worse results than MLE-SPN, while on the fourth, MAP-SPN is significantly worse. So, CVB-SPN beats both MLE-SPN and MAP-SPN. We conjecture that the poor performance of MAP-SPN is due to the nonconvexity of the objective function: even though the normalized pseudo-count vector of the prior is a good solution globally, it does not necessarily yield a good search direction locally. \tau_S is the number of unique trees embedded in SPN S, and it is raised to the power D because there are D instances in the training set. We will make all of our code publicly available if the paper is accepted. Reviewer 3: Unfortunately we cannot directly compare the non-collapsed version: it is too expensive to run in our examples. For example, in order to run the non-collapsed version of the variational inference approach, even for a data set with moderate size, say for the WEBKB data set, it requires memory of around 160.9 GB. We should (and will) mention this in the experiments section; our current discussion of intractability is in the first paragraph of sec. 3.1. Reviewer 4: We will work to improve the motivation in the paper. We agree that two main advantages of our method are that we take advantage of analytic marginal inference in SPNs, and that we develop a lower bound even in the absence of conjugacy. Compared with the recent reparameterization and score function gradient methods, our proposed method does not require sampling from the approximate posterior distribution. So, these other methods suffer from a tradeoff: if we take large samples they are computationally expensive, while if we take small samples the resulting gradients are noisy, leading to difficulty in optimization. This difficulty can be mitigated, but not eliminated, by variance reduction techniques. So, for SPNs (where exact marginalization is possible), our method is both computationally and statistically more efficient.