Paper ID: 865 Title: Stochastic Block BFGS: Squeezing More Curvature out of Data Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a limited-memory stochastic block BFGS method, which maintains and updates an estimate of the inverse of Hessian matrix using random sketching. It also combines with the SVRG method to obtain gradient with reduced variance to speed up the convergence. In fact, the linear convergence result in the paper mainly comes from the SVRG scheme, and the BFGS updates with randomized sketching further exploit curvature of the problem to achieve faster empirical convergence. Clarity - Justification: The presentation is mostly clear, and the description of the experiments are detailed for reproducibility. One place needs further explanation is the main convergence rate result in Theorem 1. It would be better to explain through more concrete estimates of the convergence rate, how it compares with pure SVRG and gradient descent. This will help readers have a more concrete grasp of the convergence rate. Significance - Justification: The proposed stochastic block BFGS method has several novel contributions, including a new metric learning framework, the limited-memory updates, and factorized form updates. These are based on concrete principles of quasi-Newton methods and adapted to combine with randomized subsampling techniques. While the convergence rate results are not better than state-of-the-art algorithms, the numerical experiments do illustrate substantial improvement over SVRG. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See comments above on clarity and significance. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a stochastic quasi-Newton method that combines stochastic block BFGS updated with the SVRG technique. Their main contribution is to (1) approximate the inverse Hessian by applying the subsampled Hessian to sketched random blocks instead of single vectors, (2) combine this with the inner-outer loop technique of SVRG. The authors introduce three sketching methods, and provide both limited-memory and factored versions of their algorithm. They finally provide a linear convergence analysis as well as accompanying experiments. Clarity - Justification: The paper is relatively easy to follow and proceeds logically. The authors begin by introducing the problem they intend to study, recapitulate some of the more common quasi-Newton methods, and in the process, motivate their algorithm. They clearly list their contributions. Some of the theoretical details remain a bit unclear: 1) The authors do not explain or provide a reference explaining the benefit of subsampling the Hessian versus using the difference of stochastic gradients. 2) More technically, what is the expectation in E[x_t] over in line 578? 3) The authors do not comment or compare the efficacy of the three different sketching methods. 4) There is no discussion on the constants in the convergence rate in terms and how their algorithm compares to that of other algorithms. Significance - Justification: The paper motivates and introduces a new stochastic quasi-Newton method but doesn't convincingly differentiate itself from other known results from a theoretical standpoint. There needs to be more discussion on the convergence rate and how it compares to other algorithms, as well as perhaps how a particular choice of sketching method will impact the convergence. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is relatively well-written and introduces an interesting method. If the details mentioned above are clarified, and the authors are able to compare their theoretical guarantees with that of other state-of-the-art methods more convincingly, then I am willing to upgrade my overall rating. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposed to use stochastic block BFGS update in stochastic gradient descent with variance reduction or SVRG. The SVRG consists of one full gradient update and a loop of stochastic gradient update. In this paper, the stochastic bock BFGS is used to accelerate the stochastic gradient update. Different sketches are studied in the stochastic block BFGS update. Clarity - Justification: 1. The paper reviewed the SGD with variance reduction and particularly stochastic BFGS. 2. stochastic BFGS with different sketches are presented. 3. Intermediate proofs are moved to the end which makes the main results more highlighted. Significance - Justification: 1. The stochastic block BFGS is not novel. The use of stochastic block BFGS in the variance reduced SGD is incremental. 2. The theoretical results are also not novel enough. In theorem 1, it looks the rate is not faster than existing results. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper proposed to use stochastic block BFGS update in stochastic gradient descent with variance reduction or SVRG. The SVRG consists of one full gradient update and a loop of stochastic gradient update. In this paper, the stochastic bock BFGS is used to accelerate the stochastic gradient update. Different sketches are studied in the stochastic block BFGS update. 1. The use of stochastic block BFGS in the variance reduced SGD is incremental. The theoretical results mostly follow Moritz. 2015, which seems not novel enough. 2. In theorem 1, how to compare \rho and m to the ones in SVRG and Moritz's stochastic L-BFGS ? It looks the rate is not faster than existing results. 3. In section 3.1, it gives the complexity of block BFGS, which is good. How about the overall complexity ? Particularly, how to compare existing algorithms ? =====