Paper ID: 793 Title: Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a matrix-valued stochastic process built upon multivariate Gaussian processes. The rows represent elements of a mini-batch whilst columns are the usual inputs/features of an analogous neural network. The authors then derive a perturbation analysis variational inference scheme. The experiments suggest that this method outperforms several other Bayesian neural network approaches (sometimes by rather a lot). Clarity - Justification: The description of the method and experiments is extremely clear. As it stands, it is probably impossible to reproduce the experiments as the hyperparameters are not fully listed (e.g., learning rates). Significance - Justification: Deep GPs have been derived before (Damianou & Lawrence, 2015) but without mini-batches and without direct comparison to other Bayesian neural network methods. There are likely several tricky problems to overcome with this approach--using structures commonly exploited in neural networks such as convolutions and scaling to larger features and data set sizes--but these are a source of interesting future directions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It's nice that smaller networks perform well compared to larger Bayesian neural networks. Do networks of the same, larger size perform even better? How much of the improvement in performance do you think is due to using deep MGPs and how much is due to the approximate inference? How would you extend this approach to convolutional networks? Nitpicking: 267: Lets' -> Let's 287: "matrix variate Gaussian property 5" reads odd (not clear it's referring to equation 5). ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a richer variational approximation for deep Bayesian MLPs that's computationally tractable, and has an interpretation as a deep Gaussian process. Clarity - Justification: I would have liked a clearer distinction between the model and the inference scheme, especially in the intro. As far as I understand, the model is initially a standard Bayesian MLP. The authors then propose a variational posterior that has covariance between weights; then use the reparameterization trick and notice that the resulting variational distribution is a projection of a deep GP, made tractable by conditioning on pseudo-data. Then, the pseudo-data are also given a Bayesian treatment too (since we're no longer doing variational inference in a Bayesian MLP, but something like FITC in a deep GP) Really there is a lot going on in this paper, which is a good thing, but it was a wild ride and I think that a little more structure would help. Significance - Justification: In the narrowest sense, this paper could be viewed as simply a clever way to provide richer covariance structure for variational inference in Bayesian neural nets. But the strong connections between different views of the same object, the reinterpretation as a deep GP, the attention paid to computational aspects, and the exhaustive experimental comparison make this a solid contribution by any measure. I would have liked to see another experiment like Figure 1, but showing other properties of the model in contrast with previous models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper overall does a great job of connecting to recent prior work. In section 4.2, it would be nice to report how many variational parameters you were optimizing, since you make an argument about this being important for performance. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose variational Bayesian inference in a neural network with a matrix variate Gaussian distribution on the parameters. The authors compare to a number of alternatives on a range of regression and classification tasks. Clarity - Justification: The text is reasonably well written, although some comments are unclear. The nonparametric flexibility is emphasized many times, but are the kernels not degenerate? This non-parametric aspect of the model needs to be further elucidated. I am not sure what you mean by: "with the pseudo-data we effectively increase the flexibility of our model, thus allowing the weight posterior to be closer to the prior without severe loss in performance". Why do we want the weight posterior to be closer to the prior, and what does this have to do with flexibility? Where are the advantages of the model really coming from? Correlations? Non-parametric flexibility? The approximate inference method? The writing is loose in places; for example, "While deep learning methods are beating every record in terms of predictive accuracy..." is an exaggeration. There are a few minor idiosyncrasies in the style... for example, I don't see why "random matrices" is italicized in the abstract. But this is just a subjective preference. Significance - Justification: The idea is nice but a little incremental and it's hard to understand exactly where the performance gains are coming from. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is fun to read, reasonably well written, and the experiments are relatively comprehensive, although do not go into particular depth. The toy dataset seems like it would easily be solved with a much simpler model like a GP with a SE kernel. In general, I'd be curious how a GP with standard kernels compares on these datasets. What are the sizes and number of predictor dimensions for each dataset? How does the approach scale? There should be some empirical results for this. The methodology is somewhat incremental, and it is somewhat unclear what results in performance gains. The comments about nonparametric properties and flexibility of the model are unclear. The comment about deep learning methods do not reliably provide confidence intervals is not entirely correct. For example, there are many Bayesian approaches to neural network estimation, going back to MacKay (1992) and Neil (1996) (PhD thesis). There are also approaches which combine Gaussian processes with deep learning architectures and look at Kronecker products of kernels. See for example, "Deep Kernel Learning" of Wilson et. al (AISTATS 2016). Why is the year dataset treated differently than the others for validation? =====