Paper ID: 1208
Title: Discriminative Embeddings of Latent Variable Models for Structured Data

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper develops a discriminative approach to learning with structured input data. Unlike in kernel learning with structured data, the structural features can be learned in a deep learning way. Experiments with word recognition and image tagging compare the approach favorably to kernel methods for structured input. 

Clarity - Justification: The approach is motivated by kernel embeddings for distributions on graphical models (not unlike Fisher kernels) and discusses this concept and inference methods for graphical models in great detail. Ultimately, however, graphical models only serve as a very high-level motivation of the approach. Only the graph on the input variables is ever used; the approach does not involve any parameters of any graphical models or any distributions over any variables. Mean-field inference and loopy believe propagation also only serve as an extremely-high-level motivation for the way in which the algorithm iterates over nodes. In equation 10, suddenly the graphical model is gone and no longer needed. I find this discrepancy between motivation and "derivation" on one hand and the actual method on the other hand quite confusing.   The paper uses bizarre hypenization rules and has countless spelling errors and grammatical flaws! Why would any language encourage the hyphenation of a word's first or last letter?

Significance - Justification: The paper addresses an interesting and relevant problem, and the high-level idea of deep learning of the structural properties that a classifier for structural input data uses (in contrast to kernel methods that implicitly learn in the space of all possible structural elements that the kernel function can distinguish) is appealing. My main issues are (1) the lack of a discussion of Chen et al. (ICML 2015), (b) the motivation and derivation of the method (see under clarity), and (c) the lack of empirical comparison to Fisher kernels.   The paper should cite and discuss Chen et al. (learning deep structured models, ICML 2015). While the problem settings are not identical, they are closely related and I find it difficult to judge the similarity between the inference procedures.   The paper compares the derived method experimentally to several string and graph kernels, but Fisher kernels are immediately dismissed as baseline "since they take too much time to run some of our baselines [spectrum kernels] have already shown better results". I am still not convinced that Fisher kernels are not an option. People have been known to be able to learn very large generative models (for instance, in speech recognition) which should be the main bottleneck in using Fisher kernels, right? "Too much time" also sounds fairly unspecific, like waiting for a baseline method to converge is just generally too much trouble. Spectrum kernels can only be applied to one of the problems; also, and I am speculating here, shouldn't there be a Fisher version of spectrum kernels?  The experimental section claims that "we are able to handle millions of data points", but then proceeds to "subsample the training data to make the algorithm put more attention on molecules with higher PCE values". Right. So how many data points did you end up subsampling? 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): see above. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a very scalable approach to supervised learning on structured data (e.g. graphs, trees, sequences). The paper embeds latent variable models into feature spaces, which are learned from the data (and labels) to be optimized for the learning task at hand. The experimental validation shows that the approach is competitive with the state-of-the-art graph kernels, while scaling far better to large datasets.  The paper is highly relevant and interesting, it reads well and is based on very novel ideas to the best of my knowledge.

Clarity - Justification: The paper is well written, interesting and easy to follow. I am not an expert in graphical models, and found Section 4 hard to follow.

Significance - Justification: Graph data are abundant and important, which gives the paper high potential impact.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Something is going wrong with division of words at the end of the line; check this for the final version.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes methods for learning discriminative non-linear embeddings of structured data, performing graphical model inference (mean field and LPB) in the embedded representation and solving prediction tasks using the learned representations and models. Experimental results demonstrate advance on particular datasets.

Clarity - Justification: The paper is relatively well written, especially the introductory sections 1-3 are clear. The methods section could use some further clarification, for example the exact neural net parametrisation (12) could benefit from some further elucidation. 

Significance - Justification: I feel the paper is a nice mixture of ingredients, combining Hilbert space embeddings with graphical model inference and gradient based learning. I don't feel it is a major breakthrough though.  I think the paper would be more significant if the connection to the kernel-based work by Song et al 2009-2011 would be more directly assessed. Now the paper mentions the methods but does not provide further discussion or experimentation. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):   - Lines 420-437: Please discuss this neural net formulation more thoroughly. Define the matrices completely (both dimensions). Also provide some intuition why that parameterisation is chosen and if there would some other potential parametrizations. - Check the algorithms 1 and 3: there seems to be some mistakes in the indexing (i vs. j, t vs. n) 

=====