We thank the reviewers for their helpful comments and feedback. Here we respond to some specific points.$
Reviewer 1:
* Novelty: we agree that the extension to the group sparse setting is straightforward (but, we believe, still very useful). However, the extension to multitask learning, where the errors might not be independent within a single sample (but across the multiple tasks); this was very surprising to us since the knockoff theory initially requires iid errors, so we believe this is quite novel.
* Defining \tilde{U}: thank you for catching this typo.
* Organization and emphasis: thank you for the feedback. In our next draft we will condense the connection from multitask learning to group lasso, and expand on the derivations with the knockoff conditions as suggested.
* Lasso path: yes, it is valid to simply take a grid of lambda values and proceed with this partial path (W still satisfy sufficiency and group antisymmetry). This is what we do in practice, with a grid of 10,000 lambda values.
* Experiments: in general if the group sparsity assumption is true, methods leveraging the group structures will be able to simultaneously attain lower FDR & higher power. In particular selecting groups of signals together means that by catching one "easy to find" signal, we instantly make several true discoveries using group lasso; this isn't the case with lasso. In addition, the group knockoff matrix is constructed with more flexibility than the regular knockoff matrix, which can help boost power when within group correlations are substantial.
The high power across all sparsity levels is due to the signal strength, which we chose so that the non-group knockoff is also finding a reasonably good solution; however we agree that more variability across the settings would be good and will adjust the settings to give a range of power in our next draft. Power is defined at the individual level in these figures in lines 648-9.
Reviewer 2:
* Regarding group FDR vs regular FDR in simulations:
For the group sparsity experiments, we do report group FDR (see Figure 1). For the multitask learning experiments, group FDR in the group lasso reformulation is equivalent to regular FDR in the multitask learning formulation. That is, in multitask learning we select the right rows of the matrix B where each row is one feature; if we stack B into a vector beta, where the rows now correspond to groups, the FDR would be named "group FDR" but is equal to the same thing. We will clarify this in our next draft.
* Regarding different FDR levels across the different methods:
We agree that in principle it would be best to adjust each method to control FDR at the same level, then to compare power. However in practice, when we run a method at level q=20%, we do not know what the resulting FDR is; we would not be able to adjust q in order to get a "true" FDR of 20% since we cannot calculate this. Therefore we have decided to compare the methods as they would be run in practice, i.e. by controlling the input parameter q, since we cannot measure the true resulting FDR unless the ground truth is known.
* For controlling group FDR using the regular knockoff technique, we are not aware of a way to do this well because the regular knockoff technique is specifically designed to estimate the FDR (not the group FDR). In other words, modifying the knockoffs to estimate and control group FDR instead of FDR would result in the group knockoff method we proposed here.
However, we give a coarse bound on group FDR based on FDR:
group FDR <= (maximum group size) * (regular FDR)
Controlling regular FDR at level q/(maximum group size) would guarantee group FDR <= q; but this would likely be extremely conservative.
* For multitask learning where the design matrices differ between tasks:
We were also very interested in this question since the HIV data we used for our experiment has missing data; we might therefore want to use different subsets of the rows of X for different tasks depending on where the response data is missing. Theoretically, however, we found that allowing different design matrices is incompatible with allowing for unknown and non-diagonal covariance of the noise. If the r-dimensional response y_i follows a linear model with iid noise e_i ~ N(0,sigma^2*identity), we can trivially rearrange this into a group lasso setup and apply the knockoff theory as stated; but if e_i ~ N(0,Sigma) for a non-diagonal Sigma (nonzero correlations are often the case in practice), then the theory does not work out. Specifically, there are some commutative properties, starting at line 513, which would not go through. In practice, however, we believe that the method would likely work well.
* Thank you for your suggestions for additional experiments and discussion regarding group size and correlations. These questions are interesting and we will add additional simulations on these questions in our next draft if space permits.