Paper ID: 429 Title: Analysis of Variational Bayesian Factorizations for Sparse and Low-Rank Estimation Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents theoretical analysis comparing two types of Bayesian models with different variational-Bayesian inference methods. The analysis derives some connections regarding sparsity of the two models, and establishes when they might give results similar to each other. Clarity - Justification: The paper is not written with clear motivation and conclusions. The abstract does not clearly state the results, and it is very difficult to understand the conclusions of the paper and their practical implications. Clearly stating the results in the abstract and then discussing why these results are useful will greatly improve the quality of the paper. Significance - Justification: The significance of the paper is not clear to me (and the writing does not help either). The results seem to be somewhat interesting, but it is not clear why they are useful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It is obvious that Low-rank models are not going to be sparse in general. It is interesting that they overlap when the prior is highly informative, but it is not clear to me why this is useful. I was not excited by other results presented in the paper, e.g. proposition 2.1 is interesting, but theoretical insights presented in this paper are not exciting. Here is a question that I have wondered about: There are two ways to gets a parsimonious model, namely low-rank models (a'*b) and sparse models (the design matrix is sparse). Given some data characteristics, can we say something about which model to choose? I believe there are interesting questions that could be answered using the proposed approach, but the ones presented in this paper are not exciting for me. Please clarify if I am missing something here. Thanks! ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper theoretically compares two sparsity inducing priors, Gaussian scale mixture (GSM) and bilinear Gaussian (BG) with variational Bayesian inference. The contributions are summarized as Propositions 2.1 to 2.5 for a sparse vector estimation setting, and Propositions 3.1 to the matrix recovery setting. Proposition 2.1 states that the objective fucntion for VB-GSM is similar to its empirical Bayesian counterpart (Wipf et al. 2011). Proposition 2.2 states that the objective function for VB-BG is different from that for VB-GSM, and depneds on phi. Proposition 2.3 states that VB-GSM with diagonal SigmaB (or if phi'phi is diagonal), VB-BG is similar to VB-GSM. Proposition 2.4 formally states that VB-BG is not invariant to coordinate change. Proposition 2.5 states that if the prior is dominated, the solution is similar to the one for L1 regularization. Proposition 3.1 states something non-intuitive for the matrix recovery setting. The main statement of the authors is that VB-BG gives less sparse solution (discussion after Proposition 2.4) when phi is correlated, which is supported by numerical experiments. Clarity - Justification: Clarity is fine. Significance - Justification: The statements of the propositions are nothing surprising. Also technically, the propositions seem just collections of the free energy and stationary conditions for the two sparse estimation setups or slight modification of them. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My main concern is in the novelty/significance of the propositions. The authors state that Proposition 2.1 supports GSM since what it does is the same as type II ML, of which good performance has been proven in Wipf et al (2011). But Proposition 2.1 simply says that Bayesian treatment of gamma in GSM is not important, and VB estimation and MAP esimation for gamma gives similar solution. I think this is what many reseachers who have implemented VB for GSM know. Also, poeople generally know that when MAP estimator behaves reasonably, changing the inference to VB usually doesn't give significant impact. I see Propositions 2.2 to 2.4 are the most interesting results in this paper. But what Proposition 2.2 states is a natural consequence from the discussion in Nakjaima et al. (2013a), where the variable tranformation from the BG and the GSM model is discussed. Proposition 2.4 sounds useful because it says that BG gives non-sparse solution when phi is correlated. But according to Proposition 2.3, you can avoid this disadvantage by forcing further independence of b so that SigmaB to be diagonal. Does this really show the advantage of GSM? It is well-known that some additional degree of freedom can break sparsity, if the additional degree of freedom can change the coordinate system. (like estimating the design matrix or dictionary can break L1 sparsity but induces low-rankness.) About the technical contribution, I see most of the derivations follow the standard procedure to derive VB algorithms and the ojective function. This is ok, if the finding is significant enough. But if the authors faced some difficulties in obtaining any of the propositions, they could emphasize them, so that the reviewer could take them into account as technical contributions. Minor comments: - Proposition 2.5 stating that in the strong prior limit BG behaves as LASSO is trivial or discussed in the literature, since the prior is equivalent. - I don't understand the last paragraph of Section 1, where the authors state that the theory for the special cases give some intuition to general cases. Please explain why you think so? - Typo in the second line of Section 2.1 dx -> dgamma. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): VB is a popular algorithm to approximate the posterior for Bayesian analysis in (high-dimensional, sparse) linear regression and low-rank matrix completion/regression. However, different VB approximations are available in these cases, the most popular being the Gaussian scale mixture (GSM) and the bilinear Gaussian factorization (BG). The authors write explicitly the minimization problems associated to these approximations and perform a heuristic analysis of the solutions. The major conclusion is that BG should be more sensitive to high-correlations in the regressors. On the other hand, it is easier to tune BG if one has a priori information on the rank of the matrix in matrix completion/regression. These intuitions are confirmed in a short simulation study. Clarity - Justification: The paper is clear and globally well written. Significance - Justification: While there is little theory in this paper, the VB approximations studied are widely used in practice and I definitely think that the conclusions of the paper are helpful for practitioners. Definitely a useful methodological paper. My only regret is that some generalization-error bound and/or convergence analysis for the algorithms is not provided, see detailed comments below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Regarding VB in matrix regression problems, two pieces are still missing in the understanding of the algorithm: 1) a proof that the VB approximation is sensible: for example, the distance between the VB approximation and the true posterior vanishes when the sample size is large. Or, a generalization error bound. Generalizations error bounds for VB are available in: Alquier, Ridgway & Chopin, On the Properties of Variational Approximations of Gibbs Posteriors, preprint arxiv:1506.04091, together with a discussion on matrix completion. This might be of some interest for the authors. 2) a proof that the optimization algorithms used in practice actually converges to the best VB approximation (up to my knowledge, there is no guarantee that the algorithm will not get stuck in a local minimum in general). Finally, the statistical problem in Equation (1), that contains regression and matrix completion as special cases, is usually referred as "trace regression" since Rohde & Tsybakov, (2011) Estimation of high-dimensional low rank matrices. Annals of Statistics, v.39, 887-930. =====