Paper ID: 178 Title: A Variational Analysis of Stochastic Gradient Algorithms Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides an SDE view for the stochastic gradient sampling for Bayesian inference. Under several assumptions, the stationary of the corresponding SDE can be obtained in analytic form. With such understanding, the hyperparameters, e.g., stepsize and minibatch, can be optimized by minimizing the KL-divergence between the posterior and the stationary distribution. The existing stochastic gradient Fisher scoring is explained from this point of view. Clarity - Justification: The paper is well-organized and easy to follow, besides several typos and claims which are not justified. Significance - Justification: This paper connects the stochastic gradient sampling with stochastic differential equations and Ornstein-Uhlenbeck processes. The novelty also comes from the idea treating the procedure as an approximator, and optimize the hyperparameters in both algorithm and model by minimizing the KL-divergence as variational inference. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper is novel in the sense that treating the stochastic gradient sampling as an approximator for Bayesian inference. By realizing the connections to SDE and Ornstein-Uhlenbeck processes, with extra assumptions regularizing the behavior of the SDE, the stationary distribution can be obtained, and then, by minimizing the variational inference criterion, the hyperparameters in SGD and in the model can be obtained. Moreover, the framework can also explain the existing stochastic gradient Fisher scoring. I think this paper is interesting and provide a new variational view for stochastic gradient sampling algorithm. However, there are several points need to be clarified. 1), The assumptions made in the paper for the analytic form of distribution is not clearly justified, especially assumption 2 and 4. In fact, the assumption 2 is regularizing the noise behavior exhibits globally property. The reason for such assumption is not clearly explained. Moreover, the assumption 4 basically proposes Gaussian distributions to approximate the posterior, which can be seen from the stationary distribution. Based on these assumptions, the stationary distribution can be obtained in analytic form, with which the hyperparameters are optimized. These assumptions are quite restricted. What are the effects if the assumptions are clearly wrong in real-world should be discussed. For example, in multi-modality models, the assumption 4 will not hold. Will the obtained parameters still valid? 2), The authors claims the stochastic gradient descent algorithm with constant stepsize will converge to a distribution. I was wondering whether there is rigorous mathematics evidence for such claim? Are there any constraints for the constant? After all, the continuous time SDE is different discretized simulation. 3), In equation (8), The $\Delta W$ may follow $\Ncal(0, I)$ without the multiplicative $\epsilon$. In sum, this paper is interesting and may have an impact in Bayesian community. However, I am expecting the authors can provide satisfying answers to the questions. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is well written and presents new insights about the link between SGD and Bayesian posterior estimation. While the theoretical part is interesting, some extra experiments would be needed. Clarity - Justification: The paper is well positioned with respect to the literature and clearly explains the link between SGD and OU processes. Significance - Justification: The theoretical part is interesting and can be really helpful to understand how to choose the step size in a Bayesian setting, but the numerical experiments for type-II maximum likelihood (empirical Bayes) are insufficient. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): About the clarity of the paper: The insights that are gained by this view are interesting, as they provide intuitive formula to interpret the step size in a "posterior-focused SGD algorithm". The link with Welling and Teh (2011) could have been made clearer, as there are probably strong similarities, although I understand that the interpretation as a SDE is the key of this paper. Also, the authors do not use the usual terms of "type-II maximum likelihood" or "Empirical Bayes", which are fairly standard techniques to choose hyper-parameters, and correspond exactly to Section 3.3. Is there a difference? If yes, why no connection is made or reference given? Minor points that could be clarified: - page 2, line 141: the SGDFS is not explicitly mentioned before. I guess this is SGD, Fisher-Scoring, but it could be mentioned explicitly. - In Eq. (28), \lambda is not mentioned in the right-hand-side but appears in the left hand-side. I guess it is not integrated and should appear as a conditioned variable in the loss, as done in Section 3.3. Please clarify. - In Eq. (26), h_max does not depend on k. I had hard time to understand this part. About technical contribution: The results obtained are intuitive and relatively simple to prove, which is a good thing for these results to be extended. One obvious question is "how to choose the step size, the conditioning matrix and the batch size automatically", as the results of the proposed theorems. Unless I'm mistaken, there are not obvious way to find these quantities and these theorems are only useful for understanding the algorithm, but do not directly lead to a new algorithm. One part that could definitely be improved is the automatic tuning of the hyperparameters: - We do not see overfitting in the second numerical example. Why not choosing a smaller training set and large test set so that the minimum of the curve is not with a very small regularization? In fact, showing the hyperparameter selection for as a function of the training sample size would be really interesting, as it would illustrate the bias that is induced by the automatic tuning. - A more interesting setting for hyper-parameter tuning: there is a real benefit in hyperparameter tuning when there are multiple hyper-parameters because cross-validation become quickly intractable (one would need to do cross-validation on a grid or hyperparameters). For example, cross-validation is nearly impossible when the prior is composed by multiple lambda_1, ..., lambda_K corresponding to the regularization of K different groups of features. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper gives a new analysis of the posterior approximations obtained by using methods based on stochastic gradient-descent. The main idea is to show that the updates can seen as draws from the Ornstein-Uhlenbeck process. Several variants are then obtained by minimizing KL divergence to set the parameters of the process. Connections to existing methods are made. Clarity - Justification: Clear explanations of the main ideas, and also existing methods. Significance - Justification: I like the simplicity of the main idea, and how easy it is to understand and apply. I believe such connections might be important to make the SGLD type of algorithms more widely used. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I do have some points that remain unclear to me, e.g. how do you estimate B. For 'preconditioned constant SGD', which epsilon do you use? For figure 1, it will be more convincing to show results for a case where the posterior is actually non-Gaussian? It will be useful to add more examples justifying the assumptions, and discussions about how weak or strong the assumptions are. =====