We thank all the reviewers for their comments. Please find our response below.$
1. The theoretical improvement (heavy-tailed error)

As mentioned in the paper, LAT/RAT can handle heavy-tailed noise (only bounded variance is assumed). The majority of the existing variable selection algorithms require a light-tailed (sub-Gaussian) noise in order to deal with "log p = O(n)" case. One exception is Lounici (2008), where the author proved that lasso can also handle heavy-tailed noise under the irrepresentable condition and a bounded design. In the article, we showed that LAT/RAT attains the same property when the design is Gaussian with a moderately growing condition number, and no further assumptions.

The assumptions in Raskutti et al. (2009) are weaker, but their results are also weaker. Their theoretical results are only applicable to quantifying the l2/l1 estimation error, but not tight on l_inf error. This means that model selection is not (tightly) guaranteed in their paper. We note that irrepresentable condition is almost necessary and sufficient when convex penalties are involved and variable selection is concerned.

2. Gaussian design

We can easily extend the design from Gaussian to any elliptical distributions. We believe that the approximate orthogonality of X^T(XX^T)^{-1}X holds for even broader families of distributions. In the paper, we revealed the plot of X^T(XX^T)^{-1}X for the real data and a diagonal pattern is clearly visible. In fact, we have investigated some discrete distributions and fixed designs and they both work well. An interesting future direction is to extend the current theory to these design matrices.

3. The condition number assumption

The only assumption we made on the design matrix is that the covariance matrix possesses a moderately growing condition number. This assumption is different from diagonally dominance. In fact, under this assumption, we can prove that X^T(XX^T)^{-1}X is approximately diagonal dominant.

The example provided by Reviewer 1 is a perfect case where the condition number assumption fails, favoring algorithms that rely on the irrepresentable condition. However, we note that the condition number assumption is just one sufficient condition that could potentially be relaxed. In fact, in the paper we investigated a similar but harder case than Reviewer 1's example in Experiment (ii). LAT/RAT perform great. We are currently working on relaxing the condition number assumption. It seems that it might be replaceable by a lower bound assumption on the eigenvalues of the covariance matrix.

In addition, we claim the condition number assumption weak is because it allows high correlations among variables, which is commonly seen in real datasets. The high correlation can easily fail the irrepresentable condition. For example, when O(s) zero variables are correlated to the nonzero variables with a correlation of 0.9, both lasso and the orthogonal matching pursuit fail but the condition number assumption holds. However, we note that there is no single perfect algorithm that can handle all scenarios exceptionally. Every algorithm possesses its own advantageous region and people choose the appropriate one to meet their own needs.

4. Sparse assumption

Since we are focusing on variable selection, the sparse assumption is inevitable. However, if the non-sparse coefficient is separable, i.e., the true sparse signals are above some threshold and the rest lie in an l-r ball, then the algorithm is still able to correctly pick these large signals. 

5. Experiment setup

In the experiment, we modified lasso to also have three stages (lasLAT and lasRAT). In the first stage, lasso selects a group of variables (as compared to Stage 1 in RAT). In Stage 2 and 3. an exactly same algorithm for RAT is applied to the selected variables. We believe the comparison is fair, because no parameter adaption is used in Stage 1 for RAT either. In addition, we have compared LAT/RAT with popular nonconcave penalties such as MCP, which should perform similarly as adaptive lasso. We will add the comparison with adaptive lasso in the next revision as readers are interested.

Concerning the sparsity level, Experiment (iii) has 15 nonzero coefficients. We will add experiments with different sparsity level in the next revision.

6. Low convergence rate and the \delta

The slow convergence (a rate of n^{-1}) is due to the heavy-tailed noise. In the article, we only assume the noise to have a bounded variance. Thus, the rate is almost impossible to go beyond \sqrt{n} under such general assumption. When the noise is assumed to be light-tailed, the rate can be improved to \exp(-n) for both Theorem 1 and 2.

We apologize for the notation confusion of \delta. The \delta in Theorem 1 and 2 is different from that used in the Stage 2 of LAT/RAT, which comes from Corollary 1. We will use a different notation in the next version.