Paper ID: 881
Title: The Sum-Product Theorem: A Foundation for Learning Tractable Models

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a unifying framework to learn tractable models for a variety of problems including optimization, satisfiability, constraint programming, inference, etc.  The unifying framework is achieved by considering sum-product functions over a semi-ring of sum operators and product operators.  When the product operator is decomposable, it is shown that any problem that boils down to summing out the variables from the sum-product function can be solved in linear time with respect to the size of the representation of the sum-product function.  Furthermore, a generic learning technique is proposed to learn the sum-product function from data.

Clarity - Justification: The paper is well written.    The only thing that was not clear to me is how to interpret the "argsum" operation.  Argmax makes sense, but how should I interpret argsum?   

Significance - Justification: The paper contributes a significant advance that will impact a broad range of fields.  Based on this work, we may be able to circumvent NP-hardness (and higher forms of complexity) in many applications by learning tractable problem specifications from data instead of manually specifying intractable problems.  Since any manual problem specification is necessarily approximate due to the limited knowledge of the domain expert, we may as well shoot for a tractable specification and learn it from data.  This is a new paradigm that could revolutionize a good chunk of computer science. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It is clear in my head how to learn sum-product functions to represent distributions for inference, but I'm not sure how the proposed generic algorithm can be used to learn other types of problems (e.g.. SAT, CSP, etc.).  Is it possible to give some examples, perhaps in the supplementary material.  NB: I haven't checked the supplementary material yet, so if there are already examples, then ignore this question.   

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a unifying view of many inference problems, including inference in graphical models, satisfiability, constraint satisfaction, and certain types of optimization in terms of operations on semirings. The paper introduces some conditions for these operations to be tractable (borrowing ideas from the knowledge compilation literature), and a simple learning algorithm to learn these tractable representations from data. The authors provide a unifying framework for several existing results on the tractability of inference tasks. The authors demonstrate an application where they learn a non-convex function that has a decomposable nature and can therefore be optimized in polynomial time.

Clarity - Justification: The paper is well written and structured. However, the notation is dense and the paper is hard to read/follow. I assume this is partially because of lack of space. A few figures would have been helpful. I think it would also be helpful to clearly indicate what is new/novel, and what is known. For example, several of the corollaries (e.g., corollary 4) are known.   Minor: Ch() seems to be undefined in definition 5  it seems that according to the definition if two nodes don't share variables then they are compatible. when you say "children of a sum node sij are all compatible product nodes ci", does this create connections with product nodes ci that are "far away" in the junction tree (don't share variables with sij)?

Significance - Justification: Even though the results are not particularly deep or surprising (some are in fact well known), the paper provides a nice overview and draws interesting connections. The main contribution I see is the unifying view and the fact that many of the ideas presented are novel/interesting for the ICML audience. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main concern is with respect to the novelty of the results. The part on learning these representations appears to be novel (although very similar to the corresponding approach for SPNs). It seems that many of the results presented on the inference/tractability appeared in a similar form in the (Kimming, 2012) technical report. It would be very helpful to provide a more in-depth comparison with that work. The authors say that the results there are exponentially weaker, but it's not entirely clear to which results they refer to. It would be great if you could provide more perspective on that, especially because it seems that the main contribution of the paper is theoretical. The part on learning non-convex functions appears to be a novel extension of the work of Friesen et al, but the experimental results provided are an interesting proof of concept and not a real application of these ideas. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Given a semiring function F, the paper states sufficient conditions (called decomposability) under which one can sum all values of such a function (for any input) in linear time. A generic outline of an algorithm for learning F is shown. It then discusses how specific tasks are an instance of this semiring problem and how optimization functions can be learned from data.  

Clarity - Justification: The presentation quality is excellent. Reproducibility is not an issue, except for the small experiment in Section 6 whose setting is contrived and may not be reproducible ("LearnSPF did not learn the leaf functions ... instead ...").

Significance - Justification:  The paper is essentially a collection of well-known results and already established connections cast into new jargon ("sum-product theorem").  First, one should know that sum-product networks (SPN) are an attempt to repackage the literature on arithmetic circuits (AC). ACs have been studied for two decades, in the context of probabilistic graphical models, and sometimes in the context of boolean logic where they are called NNF circuits. The fact that SPNs are no different from ACs is shown precisely by Rooshenas and Lowd in their ICML-14 paper. They describe a linear-time transformation from one syntax into the other. The distinction drawn in this paper, that SPFs (and SPNs) have function leafs instead of the variable leafs of ACs, is a smoke screen. Continuous function leafs are easily combined with discrete AC variable leafs by setting soft evidence/weighted inputs, which is exactly what happens in SPFs.  Second, SPFs are simply a semiring generalization of ACs. This generalized perspective on ACs is in fact quite common. For example, to find most-probable variable assignments in ACs, the textbook algorithm [2, Section 12.3.2] replaces summation by maximization. This semiring perspective is studied in detail by Kimmig et al. (2012) using the NNF circuit terminology. Hence the question arises: does the paper present significant contributions in this semiring space? The main contribution, the sum-product Theorem 1, is identical to Theorem 2 of Kimmig et al, where summing and SPF is referred to as the AMC computation, and sd-DNNF stands for smooth deterministic decomposable NNF circuits. As is clear from the translation of Rooshenas and Lowd, an SPF can be translated in linear time into a smooth deterministic NNF or AC. Hence, the property required by Kimmig's Theorem 2 is decomposability, which is also the property required by Theorem 1 in the paper under review. After filtering the noise of new terminology, it is clear that the sum-product theorem is not a novel contribution.  I will go into less detail about the significance and novelty of the remainder of the paper. Let me just note: - The SPF construction from a junction tree outlined on page 3 is a textbook AC construction algorithm [2, Section 12.4.2]. - Section 3 is a DPLL-based compilation algorithm, which is standard for AC construction. It is presented at such a high level that it becomes insignificant. Real AC compilation algorithms have this outline but the devil is in the details (caching schemes, clause learning, backjumping, restarts, etc.). - Section 4 is identical to the algorithm of Gens and Domingos (2013) for SPNs. Again, the generalization is there, yet what they present is too straightforward to be of interest. The significant innovations here would be the semiring-specific ways of implementing lines 5 and 9 and making this work in practice. These issues are not investigated except for the rather contrived initial experiment in Section 6. - Section 5 does not seem to contain any novel observations beyond the union of Kimmig et al. (2012) and Friesen and Domingos (2015).  [1] Rooshenas, Amirmohammad, and Daniel Lowd. Learning sum-product networks with direct and indirect variable interactions. Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. [2] Darwiche, Adnan. Modeling and reasoning with Bayesian networks. Cambridge University Press, 2009.       

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):  The novelty and significance issues aside, the main sum-product theorem is incorrect. The Binary Decision Diagram (BDD) semiring over the set of all BDDs has conjunction of BDDs for multiplication and disjunction for addition. Evaluating an SPF using this semiring amounts to transforming a decomposable NNF circuit into an equivalent BDD. It is well know that this incurs an exponential blow-up (see Darwiche and Marquis, 2002). Summing an SPF can therefore not be done in linear time as claimed in Theorem 1.  The main point of lines 212-318 (including Definitions 4,5,6, Corollaries 1,2, and various embedded definitions) is this. There are functions that do not exhibit variable-based independence to admit junction tree representations of bounded treewidth, yet these functions can be represented using decomposable circuits of linear size. This is well known and for a very long time in basically all instances discussed in the paper (probabilistic reasoning, model counting, SAT, CSP, etc). The paper talks about “formalizing” this known observation, but the long and tedious formalization amounts to eventually providing an example function that exhibits this property (in the proof of Corollary 2). This does not tell us anything new as far implications or insights.  The paper incorrectly suggests that SPNs/SPFs contribute the insight that tractable inference can be done on high-treewidth models. It is not true that inference in graphical models has to be exponential in treewidth. This insight is decades old. For example, the UAI’08 competition had many Bayesian networks with very high treewidth that were solved exactly by the winning solver. Hence, we already know that exact inference in graphical models is not necessarily exponential in treewidth.  The authors seem aware of the novelty issues with their work, and are over-compensating. - For example, lines 302-310 claim that the paper's results are exponentially stronger than those given elsewhere, or more general, or some results have not been shown before. These statements are not substantiated. What does it even mean for these results to be exponentially stronger? SPNs and ACs have a linear-time transformation. I see no new results that were unknown or less generally stated for ACs or NNFs. - For example, line 376 claims that the SumSPF algorithm is more general and semiring-independent. This is again not really different from AC compilation algorithms. Indeed, people compile graphical models into ACs with the goal of both computing marginals and MPE, which are two separate semiring, thus compilation is semiring-independent. - For example, line 627 claims that LearnSPF is more attractive than other algorithms. Why? LearnSPF is severely underspecified. How can one even compare it to a real learner, e.g., based on ILP? - For example, lines 692-694 state that SumSPF generalizes all UAI inference algorithms and AC compilers. These are entire research areas with their own conferences and workshops. To say that a 25-line algorithm subsumes all this research is facetious.  The first sentence in Section 4 should also reference at least the Lowd and Domingos UAI’08 paper, which can be viewed as the starting point for learning tractable models in the form of Arithmetic Circuits from data.  The Supplementary Material takes even more shortcuts in its representation of the literature: - The idempotency observation on line 027 is due to Kimmig et al (2012). - "SPNs and ACs are equivalent, but that SPNs are always smaller or equal in size."; this is misleading. The transformation is linear. The smaller size claim is up to the definition of size. - "Existing compilation methods require first encoding the graphical model in very restrictive languages (such as CNF or SDDs)"; Wrong: one can compile an SDD from arbitrary circuit inputs. - Corollary on line 155: this is not true for the standard definition of MPE which would not include the hidden variables.  Line 096 of the introduction states that ""In flat representations like graphical models and conjunctive normal form, consisting of a single product of sums, decomposability would be a trivial condition."; This is wrong, CNFs are not decomposable.  Line 681 states that "One important consequence of the sum-product theorem is that decomposability is the sole condition required for an SPN to be tractable; previously, completeness was also required. This expands the range of tractable SPNs."; There is a polynomial-time transformation from incomplete to complete SPNs, which was known for over a decade as the transformation from non-smooth to smooth NNFs or ACs, so dropping this requirement does not create additional tractability.  The logical inference tasks in Section 5.1 do not go beyond SAT solving unless one talks about the determinism property. As is, the sum-product theorem does not subsume other logical reasoning tasks.  Generating the data for learning according to line 808 is intractable, which seems unfair. Is this data generation time included into the comparison?     

=====