The reviewers should be commended and thanked for providing very reasonable and thoughtful reviews. The first paragraphs below contain a response to what may be considered the primary issue raised by the reviewers. The list at the end provides brief responses to what may be considered the other minor points that were raised.$ The primary issue raised relates to the datasets used in the numerical experiments. In particular, the datasets used are relatively small even though the algorithm is proposed as a technique for solving large-scale problems. This point is well taken. However, it is easily addressed by referring to the per-iteration cost analysis that has appeared along with proposals of other stochastic quasi-Newton methods; e.g., the paper refers to that in (Byrd et al., 2015). By considering and respecting this analysis, it is valid to demonstrate the benefits of the proposed stochastic quasi-Newton method on moderately-sized problems while claiming that the benefits would also be realized (perhaps to an even greater degree) on large-scale problems. (A sketch of this analysis was not included in the paper due to space restrictions, and since it can be found in other sources.) Such an analysis proceeds as follows. Consider, for example, binary logistic regression with a parameter vector of dimension d. One can show---see (Byrd et al., 2015)---that the cost of one mini-batch gradient estimate is approximately 2*b*d where b is the batch size. At the same time, it is well known that the cost of a matrix-vector product in a limited memory quasi-Newton method is 4*m*d where m is the limited memory history length. Overall, the cost of one stochastic quasi-Newton (SQN) iteration is approximately 2*b*d + 4*m*d while the cost of one stochastic gradient (SG) iteration with the same mini-batch size is approximately 2*b*d. The ratio of these costs (SQN to SG) is 1 + 2*m/b. This is certainly greater than one, but only by a very small amount if the mini-batch size is much larger than the history length. Developers of stochastic quasi-Newton algorithms have come to realize that one should only expect significant gains in this regime, i.e., when m/b is small. In the limited memory experiments in the paper, this value is less than 0.08, and there are many larger-scale problems for which it is appropriate to have this value be considerably smaller. Overall, while the reviewers' points on this matter are well taken, the experiments in the paper do provide evidence that the proposed method should be effective for large-scale problems. The following list paraphrases other questions/issues raised by the referees and provides responses to them. - “If they really believe that the method works they should also release their code.” + The code will be released, but has not been released thus far in order to respect the double-blind review process. - “Does the limited memory version of the algorithm actually having the self-correcting property?” + Yes, it does. This is a consequence of the limited memory approximation being a result of a finite number of updates. - “It would be more convincing if the authors compared to several competing approaches.” + This point is well taken, but such comparisons are often quite messy. After all, one can argue about what performance measure is best (training error? testing error?), how to fairly implement the algorithms of others and/or compare when different algorithms are written in different languages, and, in the present case, how one should account for the fact that certain other stochastic quasi-Newton methods actually use exact second-order information (while the one in this paper does not). Rather than confuse matters with these issues, a conscious decision was made only to compare against the default benchmark (i.e., SG) and one of the simpler stochastic quasi-Newton methods to implement (oLBFGS). After the double-blind review is over and the code is released, others certainly can make further comparisons. - “The experiments should not search over many possible parameter settings before deciding on their final values.” + The idea that users do not want to have to search over various parameter settings is appreciated. However, this is standard practice when comparing methods; otherwise, one may criticize that the parameters chosen for one algorithm might not be as good as those chosen for another algorithm. - “Experimental results should be reported based on the value of the objective, not the test error.” + Both training and testing error are interesting, which is why both have been reported.