Paper ID: 831 Title: No penalty no tears: Least squares in high-dimensional linear models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes two estimation schemes LAT and RAT to estimate a sparse loading vector for linear regression. The method essentially consists in solving twice in a row an OLS problem followed by thresholding and then to perform a final estimation by OLS. The threshold can be determined using BIC. Under the assumption that the design is Gaussian and that the noise has bounded variance, the paper establishes that the procedure recovers the support with high probability, and provides rates of convergence in L2 and Linfty norm. Finally a number of experiments the paper show among others that LAT and RAT achieves comparable or better accuracy than the Lasso combined with an OLS debiasing of the coefficient and that in the presence of correlated predictors RAT performs better and is on par with elastic net. Clarity - Justification: The paper is clearly written. A number of statement might deserve to be discussed or nuanced (see the detailed comments). Significance - Justification: The paper establishes interesting new theoretical results with an algorithm which is simpler than the usual algorithm used in the literature on sparsity, and using weak assumptions on the distribution of the noise. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I found the paper quite interesting. I have two main comments: - On lines 66 and 67, the paper says "On the one hand, methods using convex penalties, such as lasso, usually require strong conditions for model selection consistency". I would agree with this statement: typically a condition like irrepresentability is needed. However, in recent results such as Raskutti, G., Wainwright, M. J., & Yu, B. (2010). Restricted eigenvalue properties for correlated Gaussian designs. The Journal of Machine Learning Research, 11, 2241-2259. S. Zhou. Restricted eigenvalue conditions on subgaussian random matrices. Technical report, Department of Mathematics, ETH Zuerich, December 2009. these "strong conditions" have been shown to be implied by simpler assumptions made on the covariance matrix of the design if the design is Gaussian, such as the one which is made in the paper, namely that the design is Gaussian and that the condition number of the covariance of the design is lower bounded. So it is not clear to me that the assumptions needed for Lasso consistency are really stronger than the assumption needed in the paper. On the contrary, the result proven in the paper seems to rely on the fact that the covariance of the design is diagonally dominant, which seems stronger than the irrepresentability condition. It would be nice if the paper could have a discussion on this. Does the method proposed really require weaker conditions that the Lasso? What happens if we remove the Gaussian assumption on the design? - What if the design is not Gaussian? The paper insist very heavily on the fact that the noise can be heavy-tailed but is relatively discreet about the fact that the theorem only hold for Gaussian design. Other comments: - On line 069 the paper says "methods using non-convex penalties that can achieve model selection consistency under mild conditions often require huge computational expense" : I don't understand what is meant by huge computational expense. This would be the case if it was necessary to find the global optimum of these non-convex problem, but is not the case: as shown in Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348-1360. a local maximum essentially suffices which can be found efficiently computationally. - line 118 "a conditional number constraint" : do you mean "condition number constraint"? - What if the design is not Gaussian? The paper insist very heavily on the fact that the noise can be heavy-tailed but is relatively discreet about the fact that the theorem only hold for Gaussian design. - In theorem 2 the assumption that d=s < = \tilde{c} is a fairly strong assumption. This essentially means that we can only overestimate the support by a constant amount. - On line 421, there seems to be typo: this should be tau > 2 gamma' - From Theorem 4, it seems that what you obtain is essentially a slow rate (i.e. in 1/sqrt{n}). It would be interesting to comment on the possibility of having fast rates. The paper says "The final output will have the same output as if we knew the true model a priori with probability tending to 1", however we cannot set delta to 1 here, while if we knew the true model we would have delta=1. So to me the statement is a bit misleading... - The paper says that the method is not iterative however at a high level the LAT performs twice an OLS regression followed by thresholding of the coefficients and then a final OLS regression. Same with the RAT.. One could image increasing the number of iterations. Is it not possible to apply the analysis proposed in the paper to study iterative schemes? - In the experiments section, I am not sure to understand why in the presence of weak features RAT and Elastic Net perform better. It would be great if the authors could comment on that in the experiments, since they seem to expect this. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new approach to high dimensional sparse regression. The method consists of three steps. The first step applies ridge regression with a small penalty term, and then thresholds the coefficients. The second step applies ordinary least squares to the sub-model returned in the first step, followed by another thresholding on the resulting coefficients. The third step applies ordinary least squares to the new sub-model returned in step 2. Theoretical results, such as variable selection consistency and L2 consistency, are established assuming the population covariance of the design matrix is well-conditioned. Clarity - Justification: Overall the paper is well-organized. But I found this paper unclear in two important aspects. First, in the introduction the author(s) claim that their theoretical results are stronger than existing ones, but there is no detailed discussion and comparison after the statement of the main results. It is rather unclear in which way the results are stronger. Indeed, the results are substantially weaker in some very basic settings. See detailed comments below. Second, there are at least three tuning parameters involved in the algorithm: the ridge penalty in step 1, the threshold in step 1, the threshold in step 2. It is not clear BIC would produce the right choice of these tuning parameters. Significance - Justification: The proposed method falls into a rather general and well-known class of estimators in high dimensional regression: obtain a preliminary estimator and then refine it. Estimators under this umbrella include, but are not limited to, the adaptive lasso, and the folded concave penalized estimation. The conditions required by theoretical development in this paper is also somewhat strong, and the empirical comparison is far from fair. See detailed comments below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): On the theoretical side, the assumptions are indeed rather stringent. In particular, the distribution of the covariate X needs to be Gaussian, and the covariance matrix needs to have small condition number. Now consider this example: all variables in the covariate have unit variance, and pairwise correlation 1/(2k), where k, much smaller than p, is the number of nonzero coefficients. The condition number is lower bounded by p/(2k), which will render the main result useless. But this is a very easy scenario where simple greedy algorithms such as orthogonal matching pursuit would achieve exact variable selection adaptively with high probability, even without Gaussanity assumption (see Cai and Wang, 2011, IEEE Trans. Info. Theory (57) page 4680). Second, it is not clear how this method would behave if the coefficient is not exactly sparse. This substantially limits the applicability and practicality of this method. The competing methods, such as lasso or orthogonal matching pursuit (which was not compared with in the paper), are known to work well in both sparse and non-sparse cases. Needless to say, the competing methods work both when n < p and when n > p, while the proposed method is only for p > n. The numerical experiments are not fair. This is like using three-stage ridge regression to compare with one-stage lasso or scad. A fair comparison would be with adaptive lasso (where one uses a preliminary lasso estimator in the first step, and then adjust the penalty according to the preliminary estimate), and the folded concave penalty considered by Fan, Xue and Zou (2014, Ann. Stat. 42, p819). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies ordinary least squares (OLS) in high dimensions where the number of predictors is greater than the number of samples or observations. The key idea in this framework is the sparsity of coefficients. It is known that l1 regularization and other penalties can help to find an accurate solution in high dimensions. In this paper, it is proposed to use a three-step method that does not require any kind of penalties or regularizations. In the first step, OLS is applied to find a reasonable model. In the second step, they refine this model using thresholding followed by applying OLS again to find the final solution. Theoretical results are provided on the consistency of this method and the choice of parameters. Clarity - Justification: This paper is well-written. Given the large volume of work, it is easy to follow and understand the main concepts. Significance - Justification: I have a few concerns explained in the comment section. The given results are of importance, but I don't classify them to be substantial and novel. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The analysis is based on a design matrix X drawn from a multivariate Gaussian distribution. On the other hand, the key to success of this method is that the matrix \Phi=X^T(XX^T)^-1X should be diagonally dominant. Given the nice Gaussian distribution, I agree that this result is valid. However, in many real data situations, the design matrix X is far from the Gaussian distribution. For example, when working with high-dimensional datasets, the entries of X can be sparse too. Therefore, I am interested to know what authors think about the accuracy of their assumption on \Phi. Does this method work well for a fixed design matrix? I suggest to extend this result to the fixed design setting (instead of the random design matrix) and it definitely improves the value of this work. Moreover, I have concerns about the accuracy of Theorem 1. This theorem is directly related to the step 1 of Algorithm 1 and provides guarantees on its performance. The parameter \delta is not used in this step of the algorithm, however, the bound given in Theorem 1 line 390 involves \delta. I don't see any reason why \delta is important in this step while it is used in the second step of the algorithm? In addition, how is that assumed the failure probability in line 390 is small? Another concern is about the cardinality of support of the coefficient vector which is called s. In theoretical results, I don't see any explicit bound on how the parameter s can be increased as a function of p and/or n. It is important to be able to quantify an interval for the parameter s. This is my concern in the experimental part too. Typically, s is chosen to be much smaller than p, e.g., s=5 and p=10^4. I believe that this is a very simplified case and I want to see how different methods compare as a function of the cardinality s. The other benefit of this experiment is understanding the computation complexity of your method. =====