Paper ID: 719 Title: Black-Box Alpha Divergence Minimization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a tied parameter approximation to VB/EP (and alpha divergence variants) and show how to optimise objective with stochastic gradient descent. Clarity - Justification: The paper is quite is to follow for an expert. Figures and tables are nicely done. For minor typos see below. Significance - Justification: The paper presents a mature scalable version of EP/VB but makes only "internal" comparisons to different values of alpha. There has been recent work on distributed versions of EP cited in the paper. So the authors could have compared to those. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a quite nice paper but could have compared to other methods than their own. A summary of major contributions: 1. A tied parametrisation alpha-divergence approximation for interpolating between VB and EP. 2. The authors prove that the tied objective bounded from below. 3. They claim that asymptotically the tied approach will be equivalent to the general approach. This can probably quite straight forwardly be formalised more than in the paper. (This requires homogeneity of data which is really the case in practice. But that is another matter.) 4. Optimise tied factors bound with stochastic gradient descent. Minor comments: 1. Compared to VB, EP has received less attention in the machine learning community. Yes, but still a lot. :-) 2. VI -> VB ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present a black-box approach to performing alpha-divergence minimization. This combines the idea of parameter tying from stochastic EP and the black box variational inference literature. While the gradients acquired are biased, it is argued that with sufficient samples this bias is small. Results are presented on probit regression and neural network models, where the choice of optimal alpha is found to be both model and dataset dependent. Clarity - Justification: The paper is clearly written and the necessary background is covered well. Significance - Justification: While the algorithm seems reasonable it does also feel like a somewhat incremental combination of existing ideas. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While I like the overall ideas in the paper there are few issues. Firstly there is only really one model where the algorithm is tested: neural network classification (assuming we take probit regression as a special case), which tells me nothing about how this method would do in other, potentially much more multi-modal settings such as latent variable models. Secondly, it is a little disheartening that the optimal choice of alpha a) has such a significant effect on performance b) is dataset dependent. Thirdly, the use of biased gradients means that is not clear what is really being optimized. Finally as a minor point, Salimans and Knowles 2013 was the first paper to do "black box" VB. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new and interesting method for approximate inference that minimizes explicitly a "cavity-constrained" version of the power-EP energy function. However, the paper is hard to follow, and does not address the good questions both in therory and in the experiments. Clarity - Justification: The paper spends several pages to explain power-EP, but the final method is very different from a local message-passing algorithm, which is confusing for the reader. Theoretical results are not precise and the experiments are not really helpful in understanding the method. Significance - Justification: The idea of minimizing the energy function (6) is great, but the paper does not address the important questions: - why minimizing this energy function is always good? It is not exactly the same as EP (it is a constained version) so why is it good? - the experiments are only focused on showing that the optimal performances are obtained for \alpha different from 0, so that it is better than VB, but this is not surprising. What would be interesting is to have more toy experiments where we see visually multiple posterior and its approximation, and compare it with multiple methods, including VB and EP. Also, for large scale experiments, it would be good to compare them with alternative non-Bayesian approach, such as the standard methods to solve neural nets, or the MAP estimator of the Probit. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper propose a variant of the power-EP algorithm that constrains all the cavity distributions of EP to be the same. The main questions: - There is no discussion about the impact of the constraint that is introduce in this paper. Why removing this constraint has no impact on performance? In which pathological case this constraint fails to find the posterior, while EP can? If such pathological case do not exist, what are the benefit of this energy function? In which cases the minimum of this energy function is equal to the true posterior? All these questions are not clearly addressed in this paper, and would be worth discussing. Overall, I think I finally understood the method, but only after 2 or 3 readings at different time of the day. My general comments are the following: I do not think the paper as it is right now, is worth publishing, but I think the method has some interest and I would definitely support it if it was presented differently and with experiments that really help the reader to understand the quality of the approximation. My recommendation would be write a new paper about a novel approximate inference technique because starting with power-EP is not helping the reader. Here is a possible outline of a paper I would love to read: 1- write equation (9) at the beginning of the paper and illustrate (or prove) why its minimum is close to the posterior distribution, 2- explain precisely the minimization algorithm that is run, even if this is a "simplified" version that is using simple MC estimates: write down explicitly the algorithm that is used (including the sampling and averaging of the K gradients within the log-expectation terms). 3-Illustrate how it works in a toy example where the factors are written down explicitly (in the current paper, we have to guess the factors on the probit regression) 4-evaluate numerically the performances of the various method (at least VB and Power-EP) 5-make the connection to EP and SEP in a dedicated section (but at the end of the paper, to avoid having to introduce too much notations before). Minor comments: - line 404: "unlike SEP, the new method explicitly define an energy function". The energy function that is - line 469: the authors claim to "jointly optimize the approximate posterior q and the prior distribution". How can we optimize the prior distribution? A prior distribution is by definition a known quantity. - line 431: "very concentrated" does not help in being more precise... - 526-532: how do you choose K in practice? How do we know that K is too small when we make experiments? =====