We thank the reviewers for their valuable comments. Several reviewers have raised similar concerns that we address below.$
Q1: SIGNIFICANCE

We believe our new method is highly significant for automatic approximate inference. Our method BB-\alpha is an alternative to power EP that is convergent, scalable and applicable in very complex models.

Convergent implementations of power EP are not used in practice because they require solving a max-min optimization problem, which is harder than just a max problem and involves using a slow double loop algorithm. Power EP also requires to compute analytically the expected moments of tilted distributions, which is not possible in complex probabilistic models. Furthermore, the existence of stationary points of the power EP energy is difficult to prove.

We introduced the idea of tying factors in the power EP energy to address the above concerns. Our constrained version of the power EP energy

1) is lower-bounded for most useful alpha values (see lines 410-427);

2) can be optimized directly (no slow double loop);

3) enables the use of stochastic optimization and black-box inference to handle complex models and large data sets.

While the tying of factors idea was used before in "stochastic EP" (SEP), SEP works by matching moments and does not directly optimize an energy function.  This prevents SEP from being applied in a black-box manner by using stochastic optimization, unlike our method.

Our black-box approach BB-\alpha is very easy to implement. This is very appealing for practitioners -- consider VB as an example. VB only requires the stochastic optimization of the VB lower bound, which has lead to it becoming the most widely used method for deterministic approximate inference. BB-\alpha is as easy to implement as VB -- in fact VB is a limiting case in the proposed framework. Because of its ease of implementation and generality, we expect BB-\alpha to be a strong alternative to VB.

Allowing the use of different alpha-divergences is another advantage of BB-\alpha (unlike R1 commented "disheartening that alpha has such a significant effect"). By changing \alpha, we interpolate between solutions that fit locally a single mode or that cover the whole posterior. Most past work focused on \alpha -> 0 (VB) or \alpha = 1 (EP), but in many settings these \alpha values may not be optimal, so gains can come from using other values for \alpha. This is supported by the experiments that extensively compared different settings of \alpha.

In summary, BB-\alpha IS NOT INCREMENTAL since it has significant advantages over existing techniques.

Q2: CLARITY

Two expert reviewers claim that the clarity of the paper is "Excellent" and "Above Average". However, we will still improve the presentation:

R3 suggested directly starting the presentation from eq 9. We agree that the power EP energy might be misleading for non-experts and will improve this in the revision. However:

1) The connection to alpha-divergence minimization is revealed by the fixed point conditions of power EP.

2) Our method minimizes alpha-divergences.

We believe that the introduction of these ideas through power EP aids the exposition.

The reviewers also recommended walking through several toy examples. We will add more toy examples to illustrate the behavior of the method. Note, however, that there is already one toy example in the supplementary material. In this case BB-\alpha returns similar mean estimates to VB, but does not
under-estimate uncertainty as much.

Q3: EXPERIMENTAL ISSUES

R1 commented that neural network classification is the “only model where the algorithm is tested”.  This is incorrect - we consider probit regression, neural network regression and neural network multi-class classification models.  Neural networks are examples of very complicated models that have wide applicability, where the true posteriors are highly multi-modal, and inference is particularly challenging. So, although our proposed method is more generally applicable, in our view it is unfair to criticize the paper on this basis.

R1 and R2 asked about the bias in the gradient introduced by the MC approximation. This worried us too, but as shown in the experimental results (table 4), the bias is quite small with even moderate numbers of Monte Carlo samples and is dominated by the Monte Carlo variance (table 5).

The reviewers also prefer adding comparisons to other approaches (VB and EP), including non-Bayesian ones. We already compare with VB.  We show that our method with \alpha -> 0 produces the same results as VB, and improvements can be achieved by using different alpha values.  We will add comparisons with EP in tractable models (e.g., probit linear regression). In non-tractable models it is not possible to compare with EP. We believe that comparing with non-Bayesian techniques is out of the scope of the paper since our focus is on accurate automatic approximate inference.