Paper ID: 706
Title: Deep Gaussian Processes for Regression using Approximate Expectation Propagation

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a novel approximation for Deep Gaussian processes (DGPs) that relies on EP and the FITC sparse approximation. The application of FITC is straightforward  and the really interesting feature of the paper is the way the authors have applied EP.  In particular, to sidestep the memory requirements and the double-loop optimisation procedure the authors use an EP variational approximation with tied factors (similarly to the probabilistic backpropagation method). The second and more novel trick they use inside EP is to approximate the intractable integral of Z_n (that requires marginalising GP latent variables across layers) by a recursive Gaussian projection.  

Clarity - Justification: The paper is very well written and technical sound. It is quite easy to follow once you have good knowledge of GPs and EP.   

Significance - Justification:  Given that the DGPs seem to be the currently most promising Bayesian non-parametric model for deep learning, the contribution here is quite significant. This is because the authors propose  a completely new inference method for this model which has good potential. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):  The idea that really makes the whole paper possible is this recursive marginalisation trick that deals with a very difficult high dimensional integral by recursively approximating this integral with analytic Gaussian moments matching. It is very interesting that the quantities that appear here are similar with the ones appearing in uncertain inputs GPs and variational inference in GP latent variable models. At the same time I feel this trick needs to be characterised better in terms of its approximation accuracy. For instance, it would be nice if you discuss the cases where you expect this approximation to be accurate and when it is going to fail. I can observe that the approximate can be very accurate when q(u) is very sharply picked (if it becomes a delta function  then the approximation is exact), but if q(u) is very broad the approximation might be very inaccurate. It would be useful to discuss/analyse how learning is affected by this approximation.   The experimental comparison is very well organised and extensive. I noticed that you assume very small hidden dimensionality across the hidden layers. Is the method going to be badly affected in terms of generalisation performance and scalability if you use much larger  hidden layer dimensionality? It would be useful to comment on that.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper combines many existing approaches to obtain an approximate inference algorithm for deep GPs. The paper also presents an exhaustive comparison on many datasets comparing many methods.

Clarity - Justification: The paper is easy to follow, in general. But there are many details, e.g. in sEP and then ADF method, that are difficult to follow. It might be good to provide a discussion or a list of all the approximations (or assumptions) made in the algorithm, that the reader should be aware of (e.g. Gaussian approximation, then sEP approximation, then ADF approx etc.)

Significance - Justification: One contribution of the paper is in presenting an extensive experimental comparison. However, the results do not seem conclusive (please see detailed comment).  The other contribution is in proposing a new method for approximate inference for deep GPs. But it is not clear if this method is more accurate than existing methods since there is no comparison to existing methods for deep GPs. Inclusion of such results will improve significance of the paper and a discussion on this point will also be useful.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): One of the contribution of the paper is in providing extensive comparison on more than 10 datasets and many existing approaches for BNN. However, the presented results are not conclusive. The proposed approach wins on 4 out 10 datasets (boston is not significant, on energy HMC is doing well, and dataset #4 to 7 the proposed method looses to BNN). RMSE follows the same trends (in Appendix). The claims made in the abstract are not supported by these experiments, i.e. the proposed method is not "almost always" better than BNN methods.  Another issue is that there are no comparison to any existing approaches for Deep GP (e.g. Damianou and Lawrence 2013, at least for small dataset). Since the first contribution of the paper is about a new method for deep GP, it is necessary to see why an approach based on EP might beat the existing approaches. After reading the paper, I am not clear why this is the case. Instaed of comparing against many flavors of algorithms for BNN, I would rather see at least one existing method for deep GPs.  A final issue is about the new method. The motivation of the new method appears to be to reduce the memory cost of existing method which scales linearly with N. This is in good spirit, but it is not clear (to me after reading the paper) the nature of approximation involved in the existing algorithm and how do they compare to existing methods. Is your method more or less accurate than existing methods? Is it considerably different that you it is difficult to comment on it?  I assume it is the latter. In that case, experimental comparisons with existing methods become even more important. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a new approach of training deep Gaussian processes by combining three previous approximation techniques: FITC, SEP energy minimization and ADF for probabilistic backpropagation. The authors test the new framework on standard benchmark sets extensively.

Clarity - Justification: The manuscript makes a good job off describing the existing literature and approaches they're based on. A point of concern is the mathematical notation: the authors switch representation freely depending on local requirements which makes it hard to read for first time readers (e.g., switching between \alpha and \theta representation of parameters without warning). Numbering equations and using references might be one step to guide the reader in the derivations.  Personally, I would have preferred different relative weights of presenting the background material: given that SEP and PBP play such a major role in the derivation but are also relatively new, a longer discussion of these two topics might further improve readability, even if this would require dropping some details in other previous building blocks.

Significance - Justification: While the authors don't claim novelty on each individual ingredient of the framework, the combination of all factors to arrive at a working algorithm seems non-obvious to me which I judge to be significant. However, I have to note that I am not an expert on any of the three key ingredients (FITC, EP and PBP), so this evaluation needs to be taken with a grain of salt.  The good presentation together with the extensive tests and a promise to release the source code might make this work an important point of reference for future work in the same area.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I found this work to be a very solid work of ML research. The combination of existing ideas lead to a framework which makes an existing class of algorithms feasible for a broader class of data sets. In that vein, this work reminded me of the early approaches in deep neural nets with pre-training, dropout, etc. However, where the DNN ideas had been novel on their own at their time, this work is a novel combination of existing work.  The theoretical justification for the proposed method are mainly driven by practicality, but are convincingly presented. If a reader is informed about the relative strengths and weaknesses of other approximate Bayesian inference techniques, I assume she should also be able to judge and weight this approach accordingly. While there are some modelling restrictions on the kernel (l. 512) and likelihood function (supplements), they seem to be very mild such that I consider this method to be widely applicable. A potentially double-edged sword should be the large number of possible configurations which could make the model powerful but also hard to use in practice. Since this is a bane that is also typical of other deep models, I consider this issue to be neutral for my final rating.  However, the paper could be improved in this regard by comparing against the original deep GP approximations of Damianou & Lawrence (2013) and Hensman & Lawrence (2014) even if they do not constitute the closest runner-ups. While the manuscript explains that these approaches have not been evaluated on medium or large datasets (l. 230), the presented algorithm proposedly shares most of the complexity (ll. 262). Thus, it would be easier to evaluate relative merits comparing it directly on the methods the new approach is based on.  If I had to improve the manuscript, I would focus on finding a mathematical representation which ideally would be more precise and easier to follow. I had the impression that the authors have put a lot of emphasis on describing intuition behind the approximations which might be a bit over-complete. This might be shortened to make room for the mathematical precision. However, this is clearly a personal preference that need not represent the whole community.         

=====