We would like to thank the reviewers for their hard work and insightful comments on the paper. We will begin by recapitulating the main contributions of the paper as, in our view, they are two-fold. $
First, we have presented a highly novel combination of recently developed techniques that enable large scale approximate inference in deep Gaussian Processes that facilitate state-of-the-art performance for regression. One particularly novelty is the realisation that using the EP energy function in combination with the FITC approximation then enables the probabilistic backpropagation approximation to be deployed for approximating the intractable moments. Two of the reviewers identified this novelty: R4 "the authors propose a completely new inference method for this model which has good potential” and R1 "the combination of all factors to arrive at a working algorithm seems non-obvious to me which I judge to be significant".

Second, the paper presents the most comprehensive comparison of approximate inference in Bayesian Neural Networks that has been conducted. There has been a lot of recent work in this area and this comparison is likely to be generally useful. All the reviewers agreed: “...together with the extensive tests...”, “the experimental results are convincing...”, and  "The experimental comparison is very well organised and extensive".

Finally, we agree with R4 that Deep Gaussian Processes are one of the "most promising Bayesian non-parametric models for deep learning” and so we expect this paper — which shows for the first time they can achieve state-of-the-art results — to have a significant impact.

We now address more specific comments:

Reviewer 1

Q1: switching between \alpha and \theta representation of parameters without warning 
A1: Thanks for pointing this out, we will make efforts to clarify the notation. We do note that \alpha is defined on line 326 as the set of hyper-parameters and inducing points and is consistently used throughout. It was our mistake denoting the natural parameters as \theta, having previously used \theta to denote the hyperparameters.  

Reviewer 3

Q2: "The method used are more or less combination of existing approaches.”
A2: Please see the general response above for why this combination of methods is itself novel and non-obvious.

Q3: "The methods used themselves are not theoretically sound and it is not clear how easy or difficult it is to make them work in practice.”
A3: The combination of methods proposed in this paper is certainly daring with aggressive approximations being made at each stage. This combination was arrived at after considering many other approaches (including variational approaches [with and without recognition models], expectation propagation [we mentioned about this on lines 410-412 and in the supplementary material]) but the resulting method performed the best by some way and it is the most scalable. Arguably the method is as sound as modern applications of variational methods that use recognition models and Monte Carlo to enable inference in deep probabilistic models, which also have little in the way of theoretical guarantees. 

Reviewer 4

Q4: "feel this trick needs to be characterised better in terms of its approximation accuracy.”
A4: We agree. We know that a) if the model is linear (linear covariance functions) then the propagation is exact, b) for non-linear models (as the reviewer insightfully points out) highly confident q(u) will lead to accurate propagation, whereas uncertain q(u) leads to more inaccuracies, but the approximation will be under confident in this case, as the variance of the projected Gaussian is larger than the width of any mode. This is arguably a desirable property: when the model is certain about what is going on, approximate inference is accurate. When the model is uncertain, approximate inference is inaccurate but under-confident (but the unapproximated prediction would also be uncertain in this case so not much is lost). 

Q5: small hidden dimensionality across the hidden layers and its effect on generalisation performance and scalability
A5: We agree that deep GPs with high dimensional hidden layers would be very powerful. The experiments showed that increasing the dimensionality resulted in significant improvements in accuracy [see Figures 4 and 5] and we will investigate increasing this further. Promisingly, the scaling of the inference algorithm’s complexity is only linear in this quantity. We expect, however, the quality of the approximation will reduce as the dimensionality of the hidden layer increases as the posterior will become less concentrated and the probabilistic backpropagation step will therefore suffer more errors. Finally, we note that the size of the hidden layer dimensionality should not be conflated with the size of the hidden layers in a neural network (the equivalent neural network here will have infinite numbers of neurons).