We thank reviewers for their comments. $
Reviewers 1&amp;4:

w* is not the optimal solution to (1.1), but the unique sparse parameter/coefficient that generates the observation from underlying distribution, e.g. the sparse linear model with sub-Gaussian noise. Our goal is to obtain the estimation error w.r.t. w* (not optimization error w.r.t. the global optimal \hat{w}), which includes a computation error and a bias term, thus we present it in Thm. 3.3 as a computational theory. Our estimation error w.r.t. w* is of the same order with the estimation error of global optimal \hat{w} w.r.t. w*. In noiseless case, w* is also the global optimal solution. We`ll further clarify this in revision. 


Reviewer 1:

“Detailed comments”:
2. (Nguyen et al., 2014) can`t handle large κ_s even with large k (as shown in Sec. 5.1). (Nguyen et al., 2014) require κ_s <= 4/3, which is too restrictive and can be easily violated, e.g., when there exists correlations in the design matrix. The advantage of our result is that we allow a large κ_s, e.g. 100. Sure, k may not be the same order as k^* for large κ_s, but our guarantees still hold. 

3. As shown in the proof of Thm. 3.3 in Appendix (page 5), when we choose proper k, m and η, line 240 is guaranteed (the shrinkage constant <=3/4). 

4. The bias is inevitable. Though we attain the same order of error bound as l1 problems, the bias is much smaller than l1 problems in practice (shown in Tab. 3) with better iteration complexities. 

5. We provide the bound of g2(w*) in Thm. 3.5 and 3.8. Take sparse linear models as an example, g2(w*) in Thm. 3.5 attains the optimal rate O(σ  \sqrt(k* log d/nb)) for sparse models. The bound of g1(w*) is not provided as we don`t need it, but it can be obtained from the bound of g2(w*).

6. In noiseless case, support recovery is guaranteed since g1(w*) and g2(w*) are zero and we can have exact parameter recovery (also shown in Table 1 and Table 3). In noisy case, with proper conditions that the nonzero entries of the sparse parameter are strong enough w.r.t. noise, we can guarantee that there is no under-selection of true supports. If further, no over-selection of true supports is wanted, we will need extra truncation procedures. We will elaborate this point in revision.

About the “Statistical Theory”: Our results in Thm. 3.5 and 3.8 attain the same order of bound with Lasso, which is optimal for sparse models (Loh &amp; Wainwright, 2013). Also note that our algorithm is stochastic, but the statistical rates we obtain are the same with the state-of-the-art nonconvex batch algorithm (Jain et al., 2014, Loh &amp; Wainwright, 2013). 


Reviewer 4: 

“Clarity”: We`ll add a discussion to Thm. 3.3. The cardinality constraint is partly handled by RSC/RSS properties. Besides, it is also the source of the bias in parameter and objective estimations (g1(w*) and g2(w*) in Thm. 3.3).  

“Significance”: l1 norm cannot guarantee sparsity via svrg due to its truncation procedure (soft thresholding), thus it cannot guarantee that RSC/RSS hold only achieve sublinear convergence. We will remark this point in revision. 

“Detailed comments”:
1. You’re right: NP-hardness is for the worst-case scenario without further structural conditions on the model. The model assumptions of RSC/RSS properties allow us to rule out the hard instances. 

2. It is true that w* can be recovered exactly in noiseless case (σ=0), as shown in Tab. 1&amp;3. For Eq. (3.9), we consider ε<< O(σ \sqrt(k* log d/nb)), which is then absorbed by big O. We`ll clarify this. 

3. The expectation is to convey that when we take a random small subset of w, its l2 norm is much smaller than the l2 norm of w on average. This is well satisfied by sparse w and eventually results in the near linear speedup of the asynchronous procedure. This setting follows from (Reddi et al., 2015).

4. Good catch: optimality of w* in line 1221 is correct, but there is a typo that we should have ∇_I ϕ_i (w*) in Eq.(B.2), i.e., restrict the perturbation by the gradient on the set Ical. Then in Eq.(B.3), we don`t need the first two inequalities, but have the inequality to the last line directly. The rest of the proof then follows. We will update accordingly. 


Reviewer 5: 

Thanks to the reviewer for valuable comments. We consider this work as an important extension of SVRG to handle a type of non-convex problems and achieve the same linear convergence without proposing more stringent conditions. Empirical results with synthetic as well as real data indicate that this is a useful algorithm with an easily tuned parameter k.