Paper ID: 1113 Title: Sparse Nonlinear Regression: Parameter Estimation under Nonconvexity Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): See below. Clarity - Justification: See below. Significance - Justification: See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper studies the problem of sparse regression with non-linear link function. Under standard model, the paper shows that any stationary point of the resulting least squares optimization problem leads to good estimate of the unknown parameters. Comments: a. Paper in general is written well. b. The result is quite interesting, especially because due to non-convexity one can only hope to get to a stationary point. c. Theorem 1 presents error bound with scales as 1/a^2 where a is the lower bound on gradient of f'. This intuitively feels like the analysis only does first order Taylor series expansion and then treat the problem essentially as a linear function. Is this correct, and also what type of dependence on 1/a can be expected. d. The problem is motivated fairly poorly and the experimental results are mostly on simulated data. Naturally, demonstration of advantage of the method over linear on real world data would go long way in establishing importance of the problem and advantage of the algorithm. The paper considers regression with non-linear link function, leading to a non-convex problem. Authors show that any stationary point of the resulting optimization problem leads to appropriate error rates. The paper seems to consider a timely problem and result seems interesting. However a little more investigation is needed to demonstrate usefulness of the algorithm and also dependence of the error bounds on various problem parameters. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers sparse nonlinear regression models, with a known monotonic transformation applied on the uncontaminated linear measurements. For parameter estimation, under certain statistical assumptions, it is shown that any stationary point of the non-convex L1-regularized least-squares estimator can converge to the true parameter at optimal rate. For asymptotic inference, hypothesis test and confidence intervals are constructed based on decorrelated score and Wald test statistics, for which the asymptotic normality is shown. Overall the paper is of good quality in terms of writing and technicality. The only concern is that the monotonic transformation f can be unknown in practice except for some simple functions. Clarity - Justification: The paper is well-organized, and notations are clear and consistent. The differences of the problem setup considered in this paper are identified from the existing work in the literature. But the purpose of using the nonlinear model (1.1) is a bit under-motivated. It is good to see that the key assumptions are emphasized and stated separately from the results. Generally it is easy to understand the algorithm and the main results, although the technical proofs are dense and difficult to follow. Significance - Justification: The significance of this paper mainly comes from dealing with the nonconvexity. It is impressive that all the stationary points of the non-convex estimator are equally good in terms of estimation error. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): a. The model (1.1) looks similar to generalized linear model (GLM). It is better to have some comparisons between them. Does it give any benefits if we use (1.2) instead of the corresponding maximum likelihood estimator for GLM, say, logistic regression? b. The introduction of Dantzig selector (2.6) used in Section 2.3 is a bit abrupt. It might be better to give some backgrounds on Dantzig selector, which may help readers understand the test. c. In Section 3.1, in order to show the goodness of every stationary point, the sparse eigenvalue condition (3.1) is used, which is slightly weaker than RIP. However, this condition can still fail on badly conditioned design matrix. Is it possible to further relax it? d. The experiments can be improved, especially on real data. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider the sparse nonlinear regression model where data are noisy observations of a known function of linear combinations of an unknown parameter to estimate. For this purpose, a Lasso-procedure is proposed. The known function is assumed to be nonlinear, which leads to a non-convex optimization problem. The authors prove that any stationary solution achieves classical rates under the l1 and l2 losses. Testing procedures and confidence intervals are built. They enjoy optimal statistical properties in the asymptotic setting. A short numerical study illustrates the theoretical results. Clarity - Justification: The paper is well organized and most of results are clearly presented even if they are quite technical. Maybe some improvements are possible in Sections 2.3 and 2.4 that are very technical. For sake of clarity, specify in Abstract that the function $f$ is known. Assumptions on $f$ should appear more clearly in Introduction. Significance - Justification: The authors consider a very interesting and challenging problem. They obtain very nice results for very competitive topics. They clearly show the optimality of obtained rates under quite reasonable assumptions (in particular $f$ is known and monotonic). Furthermore, their contribution for confidence sets and testing is very original, providing a very complete study of their problem. The main drawback of the paper lies in the poor numerical study which does not really illustrates and enhances theoretical results. Algorithms could be also more discussed. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): To improve the paper, the authors should address the following points: - Assumption 2 is a quite strong assumption, but in some cases, it can be checked with high probability. Can we formulate the statement of Theorem 1 under weaker conditions, taking into account that Assumption 2 may hold with probability close to 1? Note that the tuning parameter $\lambda$ depends on $D$. So the impact of $D$ introduced in Assumption 2 is important. - The tuning parameter also depends on $B$, which controls the gradient of $L$. Do we have an idea of the value of $B$? - Rates depend on $a$, the lower bound of the derivatives of $f$. The authors discuss the case where $a$ goes to 0. But, do we have a clear idea of the dependence of rates on a (that does not appear in the lower bound stated in the supplementary material). - For practical purposes, the leading constant of $\lambda$ is equal to 3. This is not discussed and this value is far from the theoretical one. Some comments would be welcome. =====