Paper ID: 891 Title: Nonlinear Statistical Learning with Truncated Gaussian Graphical Models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper - proposes to use latent truncated Gaussian to tackle many machine learning problems. - show how to perform efficient inference (through VB) and sampling in these models - shown that a deep network of truncated Gaussian variables smoothly approximates a NN with ReLU - show that it has better generalization properties than VB. The amount and novelty of the contribution seems to be enough for an acceptance to ICML. Clarity - Justification: The paper is well written and mathematically sounded: notations are consistent across the paper, although quite dense for a reader not familiar with variational inference. The description of experiments is clear and seems reproducible. Significance - Justification: While truncated Gaussian variables are known by Bayesian practitioners for a long time, especially thanks to the probit regression, the extension from probit to generic non-linear architecture was not obvious and this paper opens the door to a new family of models that could be used in many different settings. some details about the complexity of the algorithm would have been welcome. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper is a good contribution to the machine learning literature. However, it is not clear to me why practitioners would use this model. Is there any reason to believe that this is always better than standard ReLU network? It is shown on several toy dataset that it gives better performance, but it is not clear why this model is always better. Is it because there is an implicit smoothing compare to the ReLU, as explained in section 2, or is there another reason for the data to be more naturally approximated by a TGGM? Also, there is a generic framework presented for graphical models, but the experiments are focused only on one task which is supervised prediction. Why should this model be also good for unsupervised data? Is it a universal approximator? What type of interaction it cannot handle? It is not clear though how much slower the ML is compared to the backprop (it should be slower if I'm not mistaken, since there are some variational parameters to tune, which is a inner optimization problem. The ratio \phi(a)/\Phi(a) is well studied and is closely related to the survival function of the Gaussian. The asymptotic expansion of this function is linear, so I would suggest the authors to look more closely at this problem as it is easy to come up with a fast computation of the integral. I'm surprised to see that the piecewise constant lookup was giving good results, as one often looks to differentiable approximations. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This submission introduces a novel class of hierarchical statistical models based on truncated Gaussian distributions. The truncated Gaussian distribution functions as a building block, like a probabilistic version of the ReLU and leaky ReLU. For inference, there are some computational benefits to having a model based on Gaussian distributions. Clarity - Justification: The writing is clear. The presentation is straightforward. A pleasure to read. Significance - Justification: This is really interest use of truncated Gaussian distributions. Finding better ways to represent uncertainty in neural networks is an important line of work. The experimental results aren’t comprehensive, but that’s OK. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): line 18 -- Readers may not know what “expected relations” mean. The definition around line 95 is clear. Line 113 -- Sampled-based inference can be pretty fast. I’d like to see a comparison of how much faster the proposed algorithm runs than sampling-based inference. (e.g., sgvb, from Kingma & Welling 2013) I don’t see the BP vs ML comparison as being so interesting, because BP is neither justified by the model nor something other people are doing, since this model is new. But it doesn’t detract from the presentation much either, and some readers may like thinking about the inference procedure in algorithmic terms. It might be interesting to see how TGGM performs on synthetic data that’s sampled according to your model, to get a feel for an upper bound on how much TGGM might improve over existing models. Also, a nice way to stress to the reader that TGGM is a probabilistic model of the data, and it works best when it’s modeling assumptions are satisfied. How good are the uncertainty estimates when the modeling assumptions are violated? TGGM-ML gets the best scores of the methods it’s compared to, but there are many methods in the literature that get higher accuracy, at least on MNIST, than all the methods reported on. It would be nice to see that either 1) some state-of-the-art models can be improved on by modeling uncertainty with TGGM-ML, and 2) the uncertainty estimates are useful for some real application and better than the kind of UQ estimates from straightforward methods like dropout. That could be future work though, what’s already here is a really nice submission. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes using truncated latent variables in a Gaussian graphical model. The simple rectification non-linearity has the advantage of making the conditional dependencies non-linear, as opposed to the standard Gaussian model. The authors then use this model for both regression and classification, and show how it could be extended to deeper architectures as well. Clarity - Justification: Some of the details of the methods should be relegated to supplementary materials, and the main manuscript should focus on the higher level ideas. Significance - Justification: The idea of using rectification in a graphical model, and comparing this model to ReLU feed forward units is interesting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper proposes a generalized EM algorithm, and compare it to a backpropagation learning rule with ReLU units that does not maximize likelihood. The authors show good, but not particularly impressive experimental results. It is not clear that the added complexity of training this model would scale well to larger networks as well. The use of rectified units in a Gaussian graphical model is also not novel--the authors should be made aware of previous work on nonnegative Boltzmann machines which use rectified units with a quadratic energy function. Those previous models showed interesting multi-modal behavior that the positive definite couplings in this paper do not display. It would be interesting to see if making the weight matrices non-positive definite would affect the performance of the tasks proposed in this work. =====