We thank reviewers R1, R5, and R6 for their valuable comments. $We will try to address their questions below. (1) R5 and R6 wonder if and how the optimal constant learning rate can be estimated, since the gradient noise covariance BB^T appears to be unknown. BB^T cannot be directly measured, but we can efficiently compute an unbiased estimate of it from a minibatch of gradients. The noise in this estimate can be smoothed out by a simple online moving average over unbiased minibatch estimates of the gradient covariance. This is also the approach taken in stochastic gradient Fisher scoring. A simpler and cheaper approach is to compute online estimates of the first and second moments, as is often done in algorithms such as AdaGrad and RMSProp. This is the approach we took in the experiments. We will clarify these points in the final version, and we thank the reviewers for pointing out the need for clarification. (2) R1 and R6 have questions about what happens when the posterior is non-Gaussian or even multi-modal. Strictly speaking, our theory does not describe in the case where the posterior is multi-modal. However, as the step size shrinks, transitions between posterior modes become exponentially suppressed (Kramers 1940). In this setup, our theory describes the case where SGD approximately converges in distribution around a local optimum. (3) R1 and R6 would like to see a better justification of assumptions 1-4 and a discussion of their strengths and limits, in particular assumptions 2 and 4. Our assumptions are stronger than in the conventional stochastic optimization literature. However, these strong assumptions let us focus on the relevant parts of our analysis, yielding fast and intuitive results. We compare empirical and predicted covariances in figures 1 and 2 and find extremely good agreement, which gives us hope that the assumptions are not completely absurd. Assumption 2 (constant noise) is perhaps the strongest assumption and clearly only approximately true. It relies on the idea that in the limit of large data (N), the posterior is tightly centered around the optimum and thus the SGD iterates are tightly bound to the optimum. Assuming smoothness, in this region we can approximate the noise covariance as constant. Based on the same argument (large N), we approximate the objective by a quadratic and thus the posterior by a Gaussian (Bayesian central limit theorem), justifying assumption 4. (4) R1 would like to know if it is guaranteed that SGD converges in distribution. This is a subtle question. One can certainly construct a problem where SGD's iterates are bounded but do not converge to a stationary distribution. We will make this claim more precise in the final version. (5) The following comments address reviewer 5. - more experiments are desirable We agree that more experiments, in particular for non-Gaussian posteriors, are desirable. We plan to perform more experiments in a future journal extension of our paper. - Empirical Bayes/type II M-L is standard. We agree; we will include more references. Our point was to show that constant SGD is a simple and scalable variational EM algorithm for type II M-L; it is simpler than conventional variational approaches. - the second experiment (covertype) shows no overfitting. We agree that we should and will repeat this experiment with a smaller training set and a larger test set. - models with multiple hyperparameters might be more interesting. We completely agree. We will include such an experiment in a future journal version of the paper, along with additional unpublished results on iterate averaging. Thank you also for your comments on Eq. 28 and Eq. 26. We will do a better job at explaining these issues, in particular the discussion about h_max. The link to Welling and Teh will be better pointed out in the related work section. (6) R6 asks what learning rate epsilon we use for preconditioned SGD. In preconditioned SGD, epsilon is absorbed into the preconditioning matrix. We will clean up the mathematical notation in this part. Thanks for pointing this out. (7) Regarding R1’s remark on Eq. 8 The reason we wrote Eq. 8 this way is motivated by the continuous-time approximation step that follows Eq. 8. To derive the continuous-time equation, \Delta W must be replaced by the Wiener increment dW which satisfies E[dW(t)dW(t’)] = delta(t-t’) dt, hence dW(t) ~ N(0,dt), and thus \Delta W ~ N(0,\epsilon).