Paper ID: 306 Title: Stochastic Quasi-Newton Langevin Monte Carlo Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a stochastic gradient Quasi-Newton Langevin Monte Carlo algorithm to reduce the computational complexity of 2nd order stochastic gradient MCMC algorithms while still preserving a dense approximation to the inverse Hessian. This is a counterpart of L-BFGS in MCMC algorithms. Clarity - Justification: The paper is well organized and it is easy to follow the main idea. The proposed algorithm is presented clearly. But I did not check the detailed proof of the main theory in the supplementary. Significance - Justification: The proposed algorithm has the nice property to maintain a lower computational complexity than most 2nd order stochastic gradient MCMC algorithms in the literature and at the same time is more flexible in approximating the inverse Hessian matrix than the preconditioned SGLD. And importantly, the Monte Carlo estimate can be show to converge asymptotically under certain conditions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Vanilla SGLD suffers from slowing mixing when the random variables are highly correlated. Preconditioned MCMC usually usesa poor diagonal approximation to keep low computational complexity and 2nd order method such as Riemannian manifold MCMC algorithm is more flexible but has quadratic computational complexity with the dimensionality. This paper proposes HAMCMC that has linear complexity in both time and space and at the same time keeps a dense approximation to the inverse Hessian. Given the success of L-BFGS in the optimization world, HAMCMC seems to be a promising algorithm for MCMC. Experiments in 4.1 and 4.2 show that HAMCMC is significantly better than PSGLD with a diagonal approximation, but the experiment on distribution MF, 4.3, does not show a clear advantage. The theoretical difficulty of introducing Quasi-Newton method to MCMC domain is that the approximate inverse Hessian depends on more than one previous iteration and the Markovian property break down. The authors manage to show the asymptotical convergence of HAMCMC on a high-order Markov chain. Also, they exploit the conditional independence of parameters and avoid the expense computation for the 'correction' term \Gamma. This is a nice theoretical contribution for their algorithm, but I did not check the proof details of Theorem 1 myself. The convergence analysis depends on the assumption of the decaying step size. That leads to large auto-correlation when \epsilon becomes small, and in the experiment 4.3 the authors actually use a fixed step size. Could the authors give some discussion about the bias of the stationary distribution with a constant step size? The proof of Proposition 1 is not rigorous since H_t in equation 7 depends on \theta_{t-M}...\theta_{t-1} and is not a Markov chain. So theorems 1 and 2 of Ma et al., 2015 do not apply directly. My main concern about this paper is in the experiments: Firstly, the main competitor in the experiments is PSGLD which has a diagonal approximation. But a cheap and more flexible preconditioned MCMC algorithm can be obtained by approximating the Hessian matrix with a moving average as used in SGFS (Ahn et al., 2012). The Cholesky decomposition can be maintained with rank-1 update. This algorithm should give at least about the same performance as HAMCMC when the Fisher information does not depend heavily on location such as the experiment 4.1 and the complexity does not have the overhead of HAMCMC that scales with M. Secondly, in experiments 4.2 and 4.3, the authors compare various algorithms given the same number of iterations. However, HAMCMC is more expensive than SGLD and PSGLD by a constant multiplier. A fairer comparison should use the running time instead of iterations. For the experiment of 4.3, the test RMSE of PSGLD should decay faster than HAMCMC in runtime. Also, in line 633, "SGRLD requires much heavier 633 computations as it needs to store a D × D matrix, ..." in the model of 4.1, the expected FIM can be precomputed. The computation of SGRLD should be very efficient in this case. Do authors repeat the same Cholesky decomposition every iteration when measuring running time? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed a quasi-Newton sampler that utilizes the history information for a faster sampler. Clarity - Justification: The paper is clearly presented. Significance - Justification: This paper presents a new idea to incorporate second order information via quasi Newton method. This is the first method I know that does so effectively(without diagonal approximation). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting paper which extends the current stochastic gradient Monte Carlo algorithms to higher order for more accuracy and efficiency. Overall, its theoretical contribution is correct and clear, with experimental evidence of excelling other methods. The paper is also clearly written. Summary: The paper uses a higher order (or, equivalently, higher dimensional) Markov chain to calculate a better approximation of the Riemannian Lagenvin dynamics for a better update step of the Monte Carlo algorithm. As opposed to the first order methods, more historical information is used to calculate the proposed state. Especially, the preconditioning matrix (inverse of the Hessian matrix) in front of the stochastic gradient is estimated with M most recent values of past iterations. This approach, while keeping the scalability of the stochastic gradient algorithms, seems to yield better results for both simulated and real world sampling tasks. - It would be helpful to summarize the L-BFGS step Eq(11) in an algorithm, to make Alg.1 more clear. - One potential drawback of the method is that it rely on using gradient from past parameters \theta(t-M), how will the memory size M affect the convergence? It would be helpful to discuss this trade-off - While comparing the results during the experiments, maybe it's better to compare the error convergence with respect to the run time, instead of number of iterations. The readers may want to better understand and compare the impact on calculation burden introduced by the quasi-Newton methods. In general, I think this paper presents novel ideas and can be of interest to the audience in ICML. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an L-BFGS version of the stochastic gradient MCMC method, called Hessian Approximated (HA) MCMC. Similar to L-BFGS, HAMCMC computes the local inverse Hessian in an efficient way so that the sampler moves the sampling space more efficiently. Although the Hessian captures the full covariance structure of the parameters, the computation and memory complexity remain linear w.r.t the dimension of the problem and the sampler remains valid in terms of the convergence. In the experiment, the authors demonstrate the advantages of the proposed method in three problems. Clarity - Justification: The paper is written very clearly and explained well. The paper was easy to follow. I really enjoyed reading it. Significance - Justification: With quite clever tricks, the authors achieve the computational efficiency while capturing the full Riemannian manifold matrix. Previously, only either of them is achieved. That is, the SGRLD or SGFS uses the full Riemannian manifold matrix, the computation was expensive. On the contrary, methods such as PSGLD was computationally efficient but used only diagonal approximation of the Riemannian manifold matrix. In this sense, the method can be considered a quite important progress in SG-MCMC methods. Also, it seems that the trick used to make the method valid in terms of convergence can also be useful in developing other methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - It would have been interesting to have some experiments on neural networks as other related works have done. - In Fig 5, I would like to see also the wall-clock time plot. - In Fig 3, it seems hard to understand how HAMCMC can be better than Gibbs sampling. Considering that both are converged, shouldn't the exact sampler (Gibbs) always be the upper bound? I would expect more clear explanation on this. =====