Paper ID: 832
Title: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents several stochastic newton methods for convex optimization. Given a convex function f they consider several schemes for subsampling coordinates and performing Newton steps on that subset of the coordinates. The paper provides convergence rates bounds for each of these methods, bounds the relative quality of each of these guarantees, and shows how they improve upon previous known guarantees for un-accelerated coordinate descent-like methods for convex optimization. The paper also demonstrates how these techniques extend to composite function minimization and empirical risk minimization in a fairly natural way and provides empirical evidence for the methods' efficacy.

Clarity - Justification: This paper is very well written and easy to follow. There notation is straightforward, their exposition is clear, and there results are well stated. Overall, the paper is enjoyable to read and written in a way where the ideas and the analysis come across naturally.

Significance - Justification: This paper considers a natural family of stochastic Newton methods, cleanly analyzes the convergence rate of these algorithms, and provides a clear comparison of these bounds to standard stochastic linearly convergence algorithms for convex optimization. Ultimately, they demonstrate how theoretically their algorithms may yield running time improvements for several natural and commonly considered optimization problems and provide empirical evidence that these algorithms yield running time improvements. While the stochastic Newton based algorithms they present are natural and the analysis is not necessarily surprising, there results and comparison between the algorithms and previous work is so clean that overall the paper sheds light on the efficacy of stochastic Newton method and I hope may serve as a stepping stone for further work in the area. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Below are several minor comments for consideration: - The phrase that SDNA uses "all local curvature information" is slightly misleading. The methods proposed just consider second order approximations to the functions considered and no more. Thus the sense in which SDNA uses "all" curvature information is unclear. This phrase is used in more than one place. - Line 201: isn't the fact that M \preceq G implied by inequalities (3) and (4) simply by the definition of \preceq? This assumption was mentioned in more than one place. - Line 262: the parenthetical comment seems a bit mysterious, I would be a little more clear here. - Line 411: why is the word "informally" used? Why isn't that sentence a formal definition?  - Line 542: is this the truth or just the best analysis known? - Line 717: how does the proposed algorithm compare to Pilanci & Wanwright?

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a second-order version of minibatch SDCA: instead of optimizing the dual variables in the minibatch independently, they can be optimized jointly by inverting the hessian (restricted to the minibatch variables). This is shown theoretically and empirically to be faster than normal minibatch SDCA.

Clarity - Justification: The paper's quality is below averaged. The notation is mildly confusing and it is somewhat dense reading. The paper also makes some not-strictly-correct inferences from the theorems (more below).  The paper sometimes goes overboard in trying to justify its claims. AFAICT the proposed variant of minibatch SDCA is very similar to the Accelerated Minibatch SDCA presented in previous work but without acceleration.

Significance - Justification: The paper's method seems interesting, however there are drawbacks. First, the method is very similar to other recently-proposed second-order methods, as was pointed out in the text. Second, the experimental results are unsatisfying, as they are on very small datasets (both in terms of number of examples and in terms of dimensionality) which might over-favor second-order methods. Finally, some of the conclusions derived from the theorems are not strictly correct.  Not comparing against the accelerated minibatch SDCA is a flaw as well, as it's not clear whether the proposed methods are better.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper could use better notation. For example, defining M_S as A_S^T M A_S where A_S is the |n| X |S| matrix where each column is zero everywhere but on the value indexed by one element of S. Then there is no need for this weird definition of inverse.  In section 3, below theorem 6 the paper makes the misleading conclusion that just because the upper-bounds on the convergence rates specified by sigma_1, sigma_2, and sigma_3 are ordered then the rates themselves have to be ordered, which is not the case. For that to be the case these bounds would have to be tight or there should be lower bounds proven somewhere as well.   Line 686 makes the same categorical mistake, and says that the theorems show one rate is always better than the other, when they merely show that a bound on one rate is always better than a bound on the other.  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new dual method to optimize regularized empirical loss problems making use of local curvature information present. Theoretical and empirical justification for the methods is provided on some small datasets.

Clarity - Justification: The paper reads well and is organized in a manner that is easy to follow. The three stochastic algorithms are summarized well at the beginning of Section 2.1 which makes it easier to get the high level idea quickly.

Significance - Justification: Efficient optimization methods that go beyond the first order methods and make use of more auxillary information such as curvature information have great potential due to growing scale of data. The paper seems to be a good step in this direction.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Since Method 2 is impractical in most situations, Method 1 and Method 3 are the key contributions of the paper along with the generalized framework which the paper presents in equation (5).   The proofs for linear convergence are provided for all the three methods, however the cost of each iteration in my opinion is an important factor for practical applications, hence the papers should discuss this in more detail along side each method.  As the paper mentions (and the Figure 3 in experiments shows), SDNA seems to do better with increasing batch sizes upto a certain limit and then its performance decreases. The reasoning behind this behavior is not clear to me. It would be good to see some precise discussion around this. Does sparsity of the data play a role in this? If so, I am not sure how robust the algorithms are since tuning the batch size will be critical for practical applications and it does not seem very easy to tune given this non-monotonic behavior (as seen in Figure 3).  To summarize, the paper makes some interesting observations about how deriving a stochastic dual method that utilizes curvature information. It makes some good connections to Iterative Hessian Sketch algorithm. However, it doesn't appear that the paper advances the area by a large margin in terms of novelty and new contributions.

=====