Paper ID: 616 Title: Collapsed Variational Inference for Sum-Product Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed a collapsed varitaional inference for sum-product networks. Clarity - Justification: reasonably clear. some parts of the big equations could be simplified to allow a better flow. Significance - Justification: This paper presented a new algorithm for sum-product algorithm using collapsed varitaional inference. The idea appears to be novel and authors managed to have a workaround for the uncomputable lower bound. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper presented a new algorithm for sum-product algorithm using collapsed varitaional inference. The idea appears to be novel and authors managed to have a workaround for the uncomputable lower bound. The experimental results did seem to show the the collapsed VB has an edge with some other baselines. It would be nice to see how the non-collapsed version works compared with the collapsed one. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a collapsed variational inference approach for estimating parameters of SPNs. In contrast to "classical" variational inference approaches, the proposed approach models the marginal posterior only over global latent variables (and not over local ones) and is thus computationally more efficient. Clarity - Justification: The paper is mainly well written and easy to follow. Significance - Justification: This paper is to the best of my knowledge one of the first Bayesian approaches for learning of SPNs. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I am happy with the derivation of the collapsed variational inference and also like the general theme of the paper. However, the experiments are not that clear to me: * Why are the reported likelihoods in Table 2 so bad for MLE-SPN? Is it because of using random initial weights? The performance reported in Gen's structure learning paper is much better than the one shown here (and also much closer to the performance of your collapsed inference approach). It is thus hard to judge the real advantage of your method. Please clarify this. * How much do results differ for multiple runs of random initializations? Are the results averaged over multiple runs for every dataset? Please indicate the variance of the results in the Table 2. * In line 742 you say that you use the weights returned by LearnSPN scaled by positive constant as a prior for the parameters. What is the purpose of the scaling? Do I understand correctly, that you only provide this type of knowledge to your learning algorithm (but not MLE-SPN)? Minor comments: * Line 080 -- what does CVB stand for? (collapsed variational Bayes?). Maybe spell this out once. * Line 207 -- descendant? * Line 380 -- the AISTATS paper "On Theoretical Properties of Sum-Product Networks" by Peharz et al. may be a better reference here. * Line 411 -- what is \tau_S^D? * Line 392 -- You use the notion of d-separation from Bayesian networks. Maybe spent a few more words detailing why it is ok to use it here. * Line 459 -- establishes => holds? * Line 507 -- is => being? * In line 642 you refer to Peharz's dissertation regarding maximum likelihood based parameter learning approaches. It seems that you mainly consider projected gradient descent and also use this approach in the experiments. I thus think the original paper by Poon & Domingos or the learn SPN paper by Gens & Domingos are better references here. * Line 703 -- the numerical issues => numerical issues * Robert Peharz's dissertation was not conducted at Aalborg University but at Graz University of Technology. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a variational inference algorithm for Bayesian estimation of the parameters of Sum-Products Networks. The algorithm they present collapses out the local variables to take advantage of the computationally efficient marginalization SPNs provide. This authors solve the lack of conjugacy induced by marginalization by a transformation that admits a lower bound. The experiments show compelling evidence for this approach. Clarity - Justification: The paper is clear to follow, but some of the motivation for their approach is flawed. Significance - Justification: Sum-product networks are an important direction in probability modeling. Thus parameter learning in them is important. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think with a little finesse of the writing, this paper would be much cleaner. Some concerns: 1) The control variates used in (Ranganath et al 2014) and (Mnih + Gregor, 2014 — called learning signals) are not model specific. Only Paisley, Blei, and Jordan 2012 develop control variates based on the model. 2) I think the computational complexity for gradients would be better served by dropping D as any non-stochastic gradient over global parameters iterates over the whole data. I think a better motivation would be to say while non-conjugate inference with little analytic work is possible with recent innovations (cite lines 119 - 121 of your paper), they fail to make use of a key analytic property of sum-product networks, easy marginalization, that allow for the optimal variational distributions of local variables to be used. A sentence about the extra lower bound vs using something like reparameterization or score function gradients on the collapsed bound would be nice as these modern techniques remove the need for a further lower bound. =====