First of all, we would like to thank to reviewers for their constructive and in-depth comments. We answer individual comments after clarifying two general questions:$
[Motivation for specific analysis for coordinate descent] Indeed the general theorems from the first part of the paper can also be used to obtain primal-dual rates for coordinate descent (CD). However, we still give a separate analysis in Section 6 since these rates are stronger (e.g. eps vs eps^2 dependence), more explicit in terms of the dependence of the constants on the dimensionality, and cover a problem class which we believe is very important in machine learning (partial separability).
[Smoothness assumption on f] We agree that it would be cleaner to state Assumption 1 earlier and add the smoothness assumption on f to every Theorem, we will fix that.
*****
Reviewer #1:
We appreciate the positive assessment and spotting the typos, which we are happy to fix.
*****
Reviewer #2:
Thanks for the thorough review and the many helpful remarks!
[Bounded Support in Lemma 1] Yes, in Section 4.1 we forgot putting the explicit assumption that $g$ has to be either strongly convex ($\mu>0$) or has to have a bounded support, in Lemma 1. Indeed this was the main motivation to introduce the Lipschitzing trick. As you pointed out correctly, when using Lemma 1’ we only consider this valid case, so the proof remains correct. We will fix this by putting it explicitly into assumptions of Lemma 1 (as well as Lemma 14), and thanks again for spotting this.
[..The computation of the duality gap could require to solve a potentially expensive problem.]
This is a valid point especially from the practical impact for our approach, which we will address. The computation of the duality gap requires the evaluation of the primal as well as the dual objective. While maintaining $A \alpha$ is easy, the main computational part consists in computing $A^T w$, which needs one pass through the data. In practice when using cheap incremental algorithms, it is recommended to only compute the gap every few iterations.
We will try to improve the presentation of the entire Section 5.1 as by your comment, more clearly distinguishing the modification of the dual from the effect on the primal.
[Theorem 6 is a bit vague]
It is important to note that the algorithms’ iterate sequence as well as the original (primal) objective evaluated at the iterates are not affected by the modification (Lipschitzing trick). In this light, Theorem 3 and 4 give valid primal-dual convergence results for the unmodified primal problem.
It is correct, as you mentioned, that the dual objective is modified and so is the duality gap (the original dual might even have been undefined for many iterates). In any case, the ‘new’ duality gap serves as a certificate of optimality for the original problem, in the region of interest.
[Corollary 7 is not meaningful. No control over Bt]
Indeed the goal here is not to control B_t, but Corollary 7 assumes some (potentially loose) bounds B_t given. This is trivial for norm-regularized problems. We simply state that whenever the level sets are bounded and some B_t is given, we have a B_t-bounded set containing all iterates and therefore can use Theorem 6 to get primal dual convergence rates for this algorithm. We will try to improve presentation of the idea.
[Elastic Net]
Thank you for the suggestion, but we intentionally put the elastic net in Section 5 since we first address it by using smoothness of the data-fit term, not the regularizer itself. The latter would indeed be simpler, but will not work well for small values of the L2 part. Our approach further allows existing primal methods for elastic net to be reused directly.
[Specific analysis for coordinate descent]
See above.
We thank the reviewer for all other comments and are happy to address them in a revised version.
(We also agree with the comment on the domain of g^*, see our first answer w.r.t. Lemma 1.)
*****
Reviewer #3:
[The tests show convergence of the duality gap, but it would be nice to show duality gap and true error in the objective, so we can see if they converge at the same rate.]
This a good comment, it indeed would be nice to perform additional experiments compare the convergence of the duality gap with the objective in practice. If space permits, we’ll be happy to include the plots of the objective in the paper as well.
[Table 1]
Thanks for the suggestion, we will simplify the loss part by only using one column.
[Equations (1a)(1b)(2a)(2b)]
We present all four optimality conditions to highlight and maintain the full symmetry of the primal and dual problems (A) and (B). But we agree that (1b), (2b) follow from (1a), (2a) and will try to simplify presentation and references for this.
[Footnote 3, line 273;]
Thank you for this comment, it is true, we need a $sup$ here as you pointed out.