Paper ID: 1202 Title: Robust Monte Carlo Sampling using Riemannian Nos\'{e}-Poincar\'{e} Hamiltonian Dynamics Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides a generalization of Stochastic Nose Hoover dynamics to Riemannian case for more effective sampler. Clarity - Justification: The outline of the paper is clear. However, some of the proof is not clear, see detailed comment. Significance - Justification: This paper proposed a new stochastic MCMC sampler Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper provides a generalization of Stochastic Nose Hoover dynamics to Riemannian case for more effective sampler. Comments: - It is helpful to discuss the relation between Nose-Poincar’e dynamics and the Nose-Hover dynamics in SG-NHT paper. - It would be helpful to write out Eq.(8) in the matrix form in terms of D and X, to make the correction term clear. - The proof of the marginal distribution (after discarding s, q) (Thm1) in A.1 is unclear. The goal should be to prove \int_{s, q} exp( - H(\theta, p, s, q)) \propto exp( - H_{gc}(\theta, p)), in order for stochastic version of sampler to be correct. - The author used \delta function to represent the probability distribution, which does not seem to equal to exp( - H(\theta, p, s, q)) - The \delta function seems to represent the distribution of data under the deterministic dynamics, and that is not directly connected to the goal (the stationary distribution of the stochastic version of sampler marginalizes to the correct distribution). - Please provide a demonstration on a special case, for example, take \theta to be one dimensional, and G(\theta) to be identity, to show that the integration over s indeed give the correct marginal. In summary, this paper constructs a new stochastic MCMC sample from Nose-Poincar’e dynamics and enhances it with Riemannian geometry. It requires a more direct proof to show that enhanced joint distribution produces the correct marginal, which is not obvious to the reader. Comments After Author's Feedback -------------------------------- The author proved that the distribution at certain potential level gives the correct marginal: "\int_{s} \delta[H - H_0] \propto exp(-H_gc / kT) / Z(\theta) ". This distribution however, is not the joint distribution p(\theta, s, q, p). Instead, this seems to represent the conditional distribution on constraint that potential equals H_0. "p(\theta, q, p | H = H_0)". This result, however, can imply that the final joint distribution integrates to the correct on. Since the condition distribution seems to be invariant of H_0. In summary, the marginal seems to be correct, but the proof need to be corrected to clarify the usage of conditional distribution. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors introduce a new Hamiltonian dynamics-based MCMC procedure that takes advantage of the modified Nose-Poincare Hamiltonian. They show that after introducing a correction term, their procedure is compatible with stochastic gradients obtained from subsampling. Finally, they provide some evidence that their procedure improves over a previous, related Nose-Hoover Hamiltonian-based MCMC algorithm. Clarity - Justification: The authors have mostly done a good job of presenting their ideas clearly, at least to an expert audience. As someone who does not ordinarily work on Hamiltonian Monte Carlo in particular, I did find some of the technical leaps difficult to follow, but perhaps this is inevitable given how much technical apparatus is required to explain these types of MCMC approaches. Significance - Justification: The paper appears to be of reasonable significance. Some questions do bother me, however. The authors certainly appear to have introduced a novel technique into the Hamiltonian Monte Carlo family, and the technical work that goes into showing their algorithm is correct is substantial enough that it seems like it would take some effort to be independently rediscovered. The experiments are also pretty good, although I would've liked to see a direct comparison to vanilla HMC and/or Riemannian HMC. Further, the experimental results would be substantially more convincing if results showing how the distribution of their samples compares to the truth (and to other MCMC procedures). All the results provided have to do with point-estimation like problems---reconstruction error in Section 4.2 and test perplexity in Section 4.3. Although they show density plots in Figure 1, this is only for a Gaussian example. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper is not bad. It appears to have some technical novelty and some potential practical advantages, and these are sufficient to warrant acceptance. Nonetheless, the paper could still be improved substantially if the authors did one or both of the following: 1. Experimental evidence that their algorithm converges to the target distribution better/faster. This could be achieved by focusing on comparisons that require getting the whole distribution right, rather than just converging to a region close to the mode (as, e.g., RMSE and perplexity seem to favor). One way to do this is to look at small examples where the posterior can be normalized and sampled from to high accuracy using quadrature. Another alternative is to run a simple MCMC procedure for a very long time and treat that as the gold standard (this would also help demonstrate that whatever transient bias is introduced by using stochastic gradients isn't biasing results of *all* the algorithms they consider---apart from Gibbs, which to their credit they do include). 2. Clarifying the exposition for MCMC experts who don't work on HMC. The paper would be stronger if people in the MCMC community who are not necessarily steeped in the HMC literature could follow it more easily. I'm not sure how feasible this is within the given space constraints, so I'm giving this more as a comment than a suggestion for immediate incorporation into the paper. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an extension to Hamiltonian Monte Carlo, incorporating Riemannian geometry and Nose-Poincare thermostats. A variant which an additional stochastic dissipator to allow for the use of stochastic gradients is also presented, and the methods are tested on some sensible examples against competitors. I think the paper is a reasonable contribution, combining several known methods to create a new one. The equations are a little cumbersome but this seems unavoidable given the content. The paper is also easy enough to read. First comment is I don't like the name of the sampler, it's too long - bordering on ridiculous. I would definitely change it though I leave the decision with the authors. I think the experiments should be comparing samplers based on wall clock time rather than number of iterations, particularly as it looks like solving this large Hamiltonian system could be quite expensive. One other comment I have is that the work of Betancourt on stochastic gradients in HMC [1] is not mentioned, and I think it should be given that it is relevant and was published in ICML last year. [1] Betancourt, Michael. "The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015. Clarity - Justification: I think the writing style is good and the paper on the whole was relatively easy to read. I have commented on Theorem 1 in the 'detailed comments' section Significance - Justification: I would say this in an incremental but relevant contribution. Several known methods are combined so there isn't much that is completely novel, but they are combined in a sensible way and the authors have clearly done some work in deriving the dynamics correctly. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think the wording of Theorem 1 is unclear. Hamiltonians don't generate samples, they are just functions from which we can define dynamics that preserve a certain measure. What you prove is that the \theta,p marginal distribution of exp(-H(\theta,p,s,q)) is as desired, meaning that the marginal \theta,p dynamics from the system governed by the Hamiltonian in eq. (4) are measure-preserving for exp(-H(\theta,p)/kT). I also think the _gc subscript should be defined somewhere. Typos: Page 6 column 2 line 581: aceptable -> acceptable =====