Paper ID: 423
Title: Stochastic Variance Reduced Optimization for Nonconvex Sparse Learning

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the problem of taking an SGD approach to problems with sparsity constraints. It borrows ideas form hard thresholding to enforce sparsity constraints; then to obtain fast convergence rates, the paper takes ideas from recent variance reduction techniques in SGD, whereby the stochastic gradient step is adjusted by using a full gradient computed at an old point. Thus this paper can be seen a providing the underlying analysis and machinery for SVRG and GHT to work together.

Clarity - Justification: The paper is easy to read, and explains well both the challenges and the relationship to prior work. 

Significance - Justification: While the results are not entirely unexpected, they seem important and potentially useful. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This paper considers the setting of minimizing a loss function subject to a cardinality (sparsity) constraint on the decision variables. The loss function, as is typical, has an additive decomposition over the data. Hence, a natural idea is to combine SGD techniques and hard thresholding. Individually, both have enjoyed a considerable about of attention. SGD has the benefit that each iteration requires access to only a single piece of the data, as opposed to the computation of the full gradient which requires a full pass over the data. Hard thresholding is a non-convex projection step, which keeps only the largest k non-zero elements of the vector at each stage. For problems of appropriate regularity (restricted strong convexity, or RIP, or other similar regularity restriction), hard thresholding has been shown to provide linear convergence. Back on the SGD side, the recent work in variance reduction (SVRG) shows that if the full function is strongly convex (and each individual component function is smooth), then re-centering each stochastic gradient step by the full gradient evaluated at an old iterate, allows a linear convergence, hence dramatically improving on the convergence rate of standard SGD, which is forced to use a very small step size to account for the potentially large (specifically: non-decreasing) variance of the stochastic gradients.    This paper combines ideas from SVRG and GHT. At some level, the results are not surprising. It is natural to expect that these directions could be combined. But the results seem useful. Also, the analysis demonstrates that there are some non-trivial technical hurdles, particularly in terms of making the SVRG-GHT combo useful without unnecessarily restrictive assumptions on the data.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): Stochastic gradient descent, and its recent variance reduced version SVRG, has been a quite popular, provably convergent algorithm (for convex functions) in machine learning. This work extends SVRG to a certain family of nonconvex functions under the nonconvex cardinality constraint. The main contribution is to re-derive the linear convergence rate (with constant per-step complexity) under restricted convexity and smoothness. This work appears to be a direct and useful extension of SVRG and the authors have sufficiently demonstrated the effectiveness of their approach through formal proofs and empirical experiments.

Clarity - Justification: The presentation is easy to follow, partly because the idea is quite natural. One thing that can be improved is the statement of the theorems. Right now they are too long and there is no indication of their end. It would also be good to discuss some of the conditions, for instance how stringent is the condition on alpha in Theorem 3.3? Also, there are two sources of nonconvexity: one from the objective and the other from the cardinality constraint. For the objective it is clear that the restricted convexity basically reduces it to the convex case. For the cardinality constraint, what makes it tractable to deal with here? Some discussion on how the proofs handles the two sources of nonconvexity might be good.

Significance - Justification: The main results appear to the a solid extension of the variance reduction technique to a restricted family of nonconvex functions. Its practical relevance is clear and the authors have done a nice job in demonstrating its superiority over the convex approach in the experiments. One thing that would have been nice is to derive Theorem 3.3 and 3.5 for the ell_1 norm, so that we can make clear comparisons to put things into perspective.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In Theorem 3.3, it is not clear what is the true model parameter w^*. Is it the global minimizer? Is it unique? (From the remaining part of the theorem, it appears so, maybe follows from restricted strong convexity?)  In line 264, the authors mentioned the NP-hardness of the cardinality function. It would be good to comment what and how this NP-hardness is avoided in this work. (Perhaps but restricted strong convexity?)  The big-O notation in Eq. (3.9) and some similar ones afterwards are not precise: taking sigma = 0 one recovers w^* exactly? I think the (additive) dependence of epsilon is missing here.   Line 452, where does the expectation come from?  The proof of Lemma 4.3 may not be correct: In inequality (B.2), the support of the gradient nabla phi_i(w) may be bigger than s, which will invalidate the RSS condition. Perhaps one should only take the gradient over the set Ical? Similar for line 1221 where the optimality of w^* is claimed.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors propose a stochastic variance reduced optimization algorithm for solving a class of nonconvex optimization problems with cardinality constraints. They provide sufficient conditions under which the estimation error decrease exponentially until certain precision. They further develop an asynchronous variant of the approach. Numerical experiments demonstrate the efficiency of the proposed method.

Clarity - Justification: Please refers to “Detailed comments".

Significance - Justification: Algorithm 1 is a straightforward application of SVRG to the gradient hard thresholding (GHT). So, the contribution of this paper is the theoretical analysis of Algorithm 1. However, there are several points that are unclear to me, which can be found in “Detailed comments".

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): About Theorem 3.3: 1. In the beginning of Theorem 3.3, the authors write “Let $w*$ be a sparse vector of the true model parameter such that $\|w*\|_0 \leq k*$”. Until Line 233, we only have a nonconvex optimization problem in (1.1). So, what is the meaning of the true model? If $w*$ is not the optimal solution to (1.1), the authors should emphasize that the optimization error in (3.5) is NOT defined with respect to the optimal solution.  2. The value of $k$ depends on the square of the condition number $\kappa_s$. So, the result is meaningful only when $\kappa_s$ is small. In other words, if we require $k$ is on the same order of $k*$, $\kappa_s$ becomes a constant. Thus, we cannot claim that the assumption in this paper is weaker than that in (Nguyen et al., 2014). As a result, the advantage of this work over (Nguyen et al., 2014) is unclear to me.  3. There is an inequality in Line 240. Why is it true?  4. There are two constant terms, i.e., $g_1(w*)$ and $g_2(w*)$ in the right hands of (3.3) and (3.4), respectively. So, we can only claim that the optimization error and estimation error decrease exponentially “until certain precision”. Due to the presence of $g_2(w*)$, the estimation error does not converges to 0, which means the nonconvex formulation in (1.1) also suffers from estimation bias.  5. The orders of $g_1(w*)$ and $g_2(w*)$ are unclear.  6. There is no theoretical guarantee about support recovery. Can the algorithm recover the support of $w*$?  About the “Statistical Theory”: Are the results in Theorems 3.5 and 3.8 tighter than the bounds derived from convex formulations? For example, is (3.9) tighter than the bound of lasso?

=====