First of all, we would like to thank all three reviewers for their detailed comments and constructive feedback. The detailed responses to the reviewers' comments are given below.$ 
General Comments:

* Wall-clock time plots for Sec. 4.2 and 4.3: 
In the SG-MCMC literature, a common practice is to compare the convergence rates of different methods by monitoring the error as a function of iterations, *provided the computational complexities of the methods are similar*. Some important examples for this convention are: (Chen et al., 2014), (Chen et al., 2015), (Shang et al., 2015), (Ma et al.,2015), (Li et al., 2016). This convention reveals the true advantages of a given method and prevents the confusion caused by specific implementations. The computational complexities of SGLD, PSGLD, and HAMCMC are similar - they all have linear time and space complexities. Especially in Sec 4.3, since we have chosen M=2, the complexities of PSGLD and HAMCMC are very similar both in computational and communication aspects (note that the burden is in the order of (M-1)D). Therefore, following the SG-MCMC literature, we have used this convention in Sec. 4.2 and 4.3. However, since this issue has been raised by all three reviewers, we will place error-vs-time plots in Sec 4.2 and 4.3. For Sec 4.2, the shape of the plot stays almost the same due to the large performance gap. For Sec 4.3, the plots of PSGLD and HAMCMC gets a little bit closer, where HAMCMC still being more advantageous. 

-Rev 1-
* Summary of L-BFGS: 
Thank you for the suggestion. Due to space limitations, we will add a pseudocode description of L-BFGS to the supplementary document and refer to it from the main paper.

* Discussion on M: 
This is a very good question. The asymptotical properties of HAMCMC is not affected by the choice of M. However, in practice, larger M results in larger gaps between the current and the previous sample. Handling this gap sometimes requires larger \lambda or smaller \epsilon. For very large \lambda, the algorithm behaves similarly to SGLD and very small \epsilon degrades efficiency. Therefore, a balance has to be set between these parameters. We have observed that choosing M between 2 and 5 works very well in all scenarios. We will add a discussion on this subject, thank you for pointing out. 

-Rev 2-

* Neural networks: 
Due to time and space limitations, we have decided not to experiment on neural networks even though they are of our interest as well. We leave this topic as futurework for a longer journal version of this paper. Thank you for the suggestion.

* Comparison with Gibbs Sampler: 
As pointed out in (Zhang and Sutton, 2011) (and others), despite its elegant theory, the Gibbs sampler may actually mix poorly on
distributions with strong correlations among variables. We believe that the performance improvement of HAMCMC over the Gibbs sampler stems from the usage of higher order geometrical information, somewhat reminiscent to the dynamics of Quasi-Newton methods versus alternating coordinate ascent methods in optimization.

-Rev 3- 

* Constant stepsize: 
For constant step size, we no longer have asymptotic unbiasedness; however the estimates will still be very close to the real expectations. We can indeed show that the bound for bias is O(1/(K\epsilon) + \epsilon) and the bound for MSE is O(\frac{ \frac1{K} \sum_k E || \Delta V_{k*}||^2 } {K} + 1/(K\epsilon) + \epsilon^2). The proof is similar to the one of the main theorem. We will add a discussion on this topic in Sec.4.3.

* Proof of Proposition 1: 
We do not agree with the reviewer on this issue. Equation 7 is indeed a Markov chain of *order M*. The only crucial part of this proposition is that H(.) depends on the current sample \theta_{t-1}, and the rest of the older samples (\theta_{t-M:t-2}) turns out to be irrelevant. Therefore, in the context of this proposition, we can denote the L-BFGS matrix as H(\theta_t), a function of the current state \theta_t. Accordingly, Thms 1 and 2 of (Ma et al., 2015) are indeed applicable and the proof is rigorous.  

* Comparison with SGFS: 
Even though the computational complexity of SGFS might be reduced by using rank-1 Cholesky updates, its overall complexity is still very large. Its space complexity is O(D^2) since it requires storing DxD matrices. It also requires computing matrix-vector products that have O(D^3) complexity. The method is similar to SGRLD in this sense and it becomes immediately impractical for large D. Therefore, in terms of computational complexity, SGFS is not actually comparable to PSGLD or HAMCMC, which both have linear time and memory complexities. For these reasons, we have decided not to consider SGFS as a competitor in our experiments.  

* Line 633: 
Yes, we also repeat computing the Cholesky decomposition at each iteration. We thank the reviewer for pointing out this issue. We will correct the footnote in the next version of the paper.