Paper ID: 507
Title: Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides a new variant of accelerated coordinate descent in which coordinates are sampled according to their smoothness parameter's square root. The accompanying analysis yields improved runtime guarantees (a factor of up to sqrt(n)), and corresponding improvements for linear system solving and ERM.

Clarity - Justification: The paper reads clearly. With regards to describing the speedup factor of sqrt(n), it'd be further clarifying to illustrate (in words) the cases in which the speedup is actually achieved. This would be even better if accompanied by a concrete example of a problem that exhibits the corresponding parameter regime, and where it might arise in practice. 

Significance - Justification: Contributions include: improved guarantee of up to a sqrt(n) factor; simplification of Lee and Sidford's sampling scheme; implied faster guarantees for linear system solving and ERM. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I suspect that this paper will be straightforward to read and understand for someone already knowledgable regarding technical aspects of optimization for convex ERM (i.e. is up to speed on the relevant papers that have been written about it in the past few years). However, my impression is that it could be slightly improved for a slightly broader audience -- one suggestion that comes to mind is to instantiate the algorithm for the linear system case (maybe only in supplemental material) and fix as many of the parameters as possible there.  In the experiments section, it could be made a bit more clear whether the datasets actually exhibit properties that exercise the algorithm's advantage. Where table 2 states the "theoretical speedup" by computing the corresponding ratio of bounds of NU_ACDM vs ACDM, it seems natural to also include a corresponding empirical speedup factor (numerically, as opposed to by looking at the plots).  There is a typo on line 10 of the algorithm 1 box.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a novel non-uniform sampling for accelerated coordinate descent. The theoretical analysis of the  paper shows that the proposed method improves the best known running time. The theoretical results are supported by the extensive numerical experiments.

Clarity - Justification: The paper is extremely well-written and the reader can go through all the details with no difficulty. The overview and background are complete and up-to-date.

Significance - Justification: The paper proposes a novel non-uniform coordinate descent approach applicable to a wide variety of applications. The method is supported by both theoretical results and also extensive experimental simulations.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I don not have any major comment on the paper. It is well-written and well-motivated. The theoretical results are interesting and experimental results are complete and extensive. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends existing analyses of non-uniform coordinate gradient descent to accelerated descent methods.

Clarity - Justification: The proof included is not clear and highly abridged; many of the details appear only in the supplemental material. There are many references to important parts of the paper (Algorithm 2, experimental setup, additional applications) that appear only in the supplemental material and hinder its clarity. It is standard to move some of the secondary details of a paper (usually complete proofs or derivations) to the supplementary content, but in this case the authors have left one of the algorithms they introduce and partial content on their simulations out of the main manuscript; this makes it difficult to assess its suitability in fairness to the rest of the papers being reviewed. Thus my review ignores this supplemental material in its assessment.

Significance - Justification: The results seem intuitive and unsurprising (randomization improves convergence, accelerated descent has faster convergence, therefore their combination should do even better). The order of the speedup and the novelty of the proof technique may be of interest.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A suggestion for the authors to improve the clarity of the manuscript is to reorganize the portions that belong in the main paper and the supplemental material. For example, move the entirety of the proof of Theorem 5.1 to the appendix, which may allow for some of the content in the appendix to be moved back to the manuscript. For example, Section 6 refers to Algorithm 2 which is not included in the paper; same for Figure 7 in Section 7.  It is not clear to me whether the sqrt(n) speedup described in footnote 3 can be reached. If the probabilities are proportional to Li and only one Li is nonzero, then only one of the dimensions will be sampled and both approaches will match each other. Furthermore, the combination of Lemmas 5.2 and 5.3 into 5.4 is unclear., as is the wrap-up of the proof. I concede that the paper may not be in my area of expertise, but it was difficult to try to obtain a clearer understanding with the curt version of the proof in the main paper.  Section 1.2 describes "RCDM with [different probabilities]". My understanding from this paper is that different algorithms are distinguished from one another by the choice of probabilities for different dimensions, so it is not clear what this means; if there are additional differences between algorithms, it would be good to mention them in the manuscript.  Some of the notations introduced in Section 4 are used earlier, so it may make sense to have this section earlier in the paper.  Minor comments:  * A term x_{k+1} appears to be missing from line 468 * In Theorem 5.1, there is a ^T on a scalar amount that doesn't make sense (or does it mean something other than "transpose"?) 

=====