We thank the reviewers for valuable comments. We will revise accordingly in the final version. $
Reviewer_2:

In the final version, we will specify that $f$ is known in the abstract and present the assumptions on $f$ in the introduction. Also, we will add more real-data results to the numerical study. 

(1) Assumption 2:

Yes, Assumption 2 can be weakened to the “high probability” version. Correspondingly, we can prove the strengthened version of Theorem 1 by attaching an additional conditioning argument. 

The parameter $D$ in Assumption 2 depends on the max-norm of the population covariance matrix $\Sigma = E XX^T$, which relies on the choice of the random measurement vectors. Concentration inequality shows that $\hat\Sigma $ converges to $\Sigma$ in max-norm, which suggests that we can set $D = 2 \| \Sigma \|_{\infty}$.

We will add these two remarks with more details in the final version. 

(2) Constant $B$:

As shown in (D.1) in the proof of Theorem 1, conditioning on the design matrix, each element of $\nabla L(\beta^*)$ is an average of independent sub-Gaussian random variables whose$psi_1$-norm bounded by $b * \sqrt{D/n}$, where $b$ is the upper bound of $f'$. By taking a union bound and applying the tail probability bound for sub-Gaussian random variables, we obtain the bound on the max-norm of $\nabla L(\beta^*)$. Thus we can bound $B$ by $b \sigma \sqrt{D}$. We will make this dependency clearer in the final version.

(3) Dependency on $a$:

The upper bound of the estimation error is proportional to $a^{-2}$. As shown in the appendix, the minimax lower bound is proportional to $b^{-2}$, where $b$ is the upper bound of the derivative of $f$. If $a$ and $b$ are of the same order, our upper bound is optimal. 

(4) Constant in tuning parameter:

Our theory works if $\lambda = C \sigma \sqrt{s \log d/n}$, which is used to upper bound the noise. Here $C$ is a sufficiently large constant. In theoretical analysis, $C$ comes from an upper bound that is potentially loose in terms of constants. The order of $\lambda$ is usually more important compared with the constant. 

Note that in practice $\sigma$ is unknown. So we always have to resort to tuning methods such as cross validation to determine the constant in $\lambda$.

Reviewer_6:

c. In fact, our analysis does not treat $f$ as approximately linear. All we need is a lower bound on $f’$ that prevents $f$ from being entirely flat (which makes estimation impossible). Aside from this condition, $f$ can be arbitrary and very different from a linear function. The dependency on $a$ is $a^{-2}$, and is essentially tight according to the lower bound in the appendix. 

d. We are motivated to understand the empirical success of the common practice of straightforwardly performing Lasso in the presence of nonlinearity (which leads to a nonconvex problem) in statistics and signal processing. Part of our motivation is to understand the feedforward neural network from the scratch. Following the suggestions of the reviewer, we will add more details on the motivation and real-data examples in the final version.

Reviewer_7:

a. Yes it looks very similar to GLM but has some difference: The noise is additive in (1.1), while in GLM the randomness comes from exponential family distributions. They have different applications and thus are not directly comparable.  

b. We will add some background on Dantzig selector in the final version.

c. It is possible to slightly relax the sparse eigenvalue conditions. However, it is unlikely that we can handle ill-conditioned design matrices. Note that this setting is difficult even for the linear regression model, and there are converse theorems (e.g., http://www.eecs.berkeley.edu/~wainwrig/Papers/RasWaiYu11.pdf among others) suggesting that parameter recovery becomes impossible for ill-conditioned design matrices.

d. We will add real-data examples in the final version.