We would like to particularly thank the reviewers for the in-depth review, constructive criticism and comments that they provided. Before addressing each one of the comments we would like to comment on the remark of Reviewer_2 that “the idea is incremental”; we argue that this is not the case for multiple reasons. First of all this is the first (according to our so far knowledge) application of the concept of pseudo-data (PD) in the context of Bayesian neural networks (NN). Secondly, this is also one of the first works that goes beyond fully factorized variational posteriors for a NN; with the matrix Gaussian we can straightforwardly introduce correlations among the hidden units and furthermore greatly reduce the amount of variational parameters with approximations to the covariances. For example at the 3x750 scenario(section 4.2) the overall amount of variational parameters that we are optimizing for each layer is about 640K whereas with fully factorized posteriors we have 1125k, almost double the amount of our model. Finally the fact that we are working with the primal space, i.e. weight space, allow us to easily scale to large datasets via the use of minibatches. This is in contrast to vanilla deep GPs where the amount of parameters grows linearly with the size of the data and even the recently published “Variational Auto-Encoded Deep Gaussian Process” (ICLR 2016) which requires (for scalability) distributed computation for the evaluation of some likelihood terms. $
Now as far as the non-parametric property is concerned; strictly speaking yes, the kernels are indeed degenerate (i.e. they have finite rank). In practice this means that each GP in our formulation has support for a finite amount of datapoints: an upper bound of K where K is the dimensionality of the input. This is ensured in our experiments as the amount of PD, M,  is always less than K therefore combined with a real datapoint we have a positive definite kernel for the row covariance. However do note that the form of K(x,y) = \sigma(x)U\sigma(y) can be seen as an approximation to a non-parametric kernel if we employ Mercer’s theorem (hence the nonparametric mentions); the approximation becomes more accurate as we increase the amount of units in each hidden layer.

As for where the improvements come from; empirically it seems that both the correlations and the PD provide significant gains. On the one hand, we observed that generally we can obtain a boost in performance by increasing the amount of correlations (i.e. the rank of the row and column covariances) as the model becomes more flexible and can better fit the data. On the other hand, we also observed that with the PD we can reduce the overall variance of the model significantly thus allowing for “easier” optimization (inference). This effect can be better understood by observing the lower bound (without the PD) that is employed: it is composed from the expected log-likelihood and the negative KL-divergence between the prior and the posterior (regularization loss). For the case of the toy experiment (the purpose of which was not to compare with a regular GP (indeed the underlying function is quite simple) but rather to compare with other Bayesian NN methods) the log-likelihood is composed from only 20 datapoints and due to the noisy nature of the Monte Carlo integration the log-likelihood does not provide strong enough learning signals and as a result the cost is dominated by the KL-divergence. Therefore the model pushes the posterior towards the prior which leads to underfitting and noisy predictive distributions (you can see this effect at Figure 1a). Now with the PD we can counter for this fact as we add extra degrees of freedom to the inference model (i.e. we increase the flexibility): if the input is similar to the PD then we obtain a significant variance reduction (eq. 8) thus allowing less noise during optimization which in turn makes fitting the data easier (Figure 1d). 

As for the other comments; for the regression datasets you can find the sizes at the “Probabilistic Backpropagation” paper (we did not include them due to space constraints in the table) and the year dataset was treated as such because this is how the instructions of this dataset define evaluation (more information can be found at the UCI repository). Complexity wise, a typical variational Bayesian NN has computational complexity of O(K^2) for each layer whereas our model has O(K^2 + M^3). Since M << K this does not incur a significantly extra computational cost. Extending to convolutions seems to be straightforward as the convolution operator performs inner products between a matricized kernel and a vectorized patch; we can employ similar machinery and view a convolutional layer as a (finite-rank) GP that is “slided” across patches in an image. Finally, of course there are other approaches that treat NN in a Bayesian way; we were referring to the models currently used in most deep learning applications.