Paper ID: 845 Title: The knockoff filter for FDR control in group-sparse and multitask regression Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends the Knockoff filter in the application case where features are grouped. That way instead of having to select from independent features we select groups of features. Multitask regression is applied for the task (instead of simple lasso). Knockoff filter is a recently introduced idea and the extension to groups of features is timely and relevant. Also the developed theory is sound. Extensive experiments are also presented. Incremental but still a nice work. I recommend acceptance. Clarity - Justification: The paper is very clear. Significance - Justification: The paper extends the recently introduced Knockoff filter in order to discover set of relevant/irrelevant features instead of single features. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In general I like the paper and the proposed methodology. The proposed extension is timely and novel. I recommend acceptance. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors extend the idea of knockoffs to the group selection setting. It is an important problem and the work generally seems solid an interesting. Some things could be better written / clarified, both in terms of the exposition surrounding the methoda and the experiments. Aside from this, my only criticism is that the work seems to follow very closely from Barber & Candes (2015), the original knockoff paper on variable selection (not with groups), so it's unclear to me how difficult this extension was. That isn't a very serious criticism (and shouldn't be, in my opinion), but it does hurt the paper a bit on the novelty side. Clarity - Justification: Some things could be better clarified in the exposition and the expeirments. See my detailed comments below. Significance - Justification: The work seems to follow very closely from Barber & Candes (2015), the original knockoff paper on variable selection (not with groups), so it's unclear to me how difficult this extension was. That isn't a very serious criticism (and shouldn't be, in my opinion), but it does hurt the paper a bit on the novelty side. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The background details are appreciated, but I don't think you need to spend nearly an entire column (or more) describing how multitask learning fits into the group lasso setup. You can just state this to be true and omit details, or provide the details on the transformation in the supplement. I think to most readers, this transformation will be obvious ... - I don't understand when on page 3 line 306 you define \tilde{U} as "a n × p orthonormal matrix orthogonal to the span of X". If X is n x p, then its span --- by which I assume you mean column space, which I could write as col(X) --- is p-dimensional subset of R^n. Assuming p < n here. So an orthnormal matrix whose columns are orthogonal to col(X) would have to be of dimension n x (n-p). - It would be helpful to step through the algebra, even just very briefly in 1-2 lines, as to why your proposed group filted \tilde{X} on page 3 line 304, satisfies the two desired properties in (11). Otherwise the reader will try to verify them in his or her head. Just more generally, I think you could give a little more details below (11), and spend less time on background details leading up this ... I.e., define the group knockoff and group knockoff filter explicitly, with displayed equations for easy reference, since they are the main proposals in your paper! - In Sec 3.2 and 3.3 you define things in terms of the group lasso path, i.e., the full continuum of solutions as \lambda varies. But this path is not easily computable, unlike the lasso path, which is exactly computable by LARS (I believe the exact solution path for the group lasso requires us to repeatedly solve an ODE ... I think Hua Zhou has done work on relates to this topic ... so have Luigi Augugliaro and coauthors). So in practice would you simply compute solutions over pre-defined (fixed) discrete grid of lambdas? And does this work theoretically? I would assume so, but it would be good to be explicit about this early on. - The experiments seem generally solid, but I'm confused about the power behavior in the figures. How can the group lasso knockoffs both have lower FDR and higher power? Is the group lasso performing that much better in terms of variable selection than the lasso here? And the fact that the group lasso has power about 1 here across all sparsity levels seems a bit odd. What is the precise definition of power --- is this a variable-wise definition or a group-wise definition? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper generalizes the knockoff filter for the lasso to the case of the group Lasso and the multitask group lasso. It presents a construction of knockoff variables for the group case, and show that the group FDR for an appropriate statistics is controlled. The method is illustrated both on simulated and real datasets. Clarity - Justification: The paper is well written. Maybe some points could be discussed in more details (see detailed comments) Significance - Justification: The paper considers a question which is quite natural and quite interesting. The paper is well written. The generalization to the group Lasso does not seem to present serious technical difficulties (or the technical difficulties might arise to propose some refinements of the methods that are not considered in the paper), but the investigation and the result presented are however worthwhile. The simulated data illustrate well the influence of several parameters on the performance of difference methods (correlations inside groups, number of task in the multitask setting etc). However, in the comparison made the FDR which is controlled for the usual knockoff and for the pooled knockoff are naively the usual FDR for the detection of individual variable and not the group FDR. The comparison is therefore not fully satisfactory, even if from the presented results it seems clear that if the group-FDR was itself controlled at the right level, the results would be even more in favor of the group knockoff. It seems that if the group-FDR could be controlled in all methods it would make for cleaner experiments. Also and or as an alternative, the authors could discuss more the potential difficulties associated to controlling the group-FDR for the simple knockoff procedure, and how it would actually in some cases require more work. More generally it seems that some of the discussion proposed in the paper could be further developed and refined: it seems that there are perhaps more things to say and analyze in the results... Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper considers a question which is quite natural and quite interesting. The paper is well written. The generalization to the group Lasso does not seem to present serious technical difficulties (or the technical difficulties might arise to propose some refinements of the methods that are not considered in the paper), but the investigation and the result presented are however worthwhile. The simulated data illustrate well the influence of several parameters on the performance of difference methods (correlations inside groups, number of task in the multitask setting etc). However, in the comparison made the FDR which is controlled for the usual knockoff and for the pooled knockoff are naively the usual FDR for the detection of individual variable and not the group FDR. The comparison is therefore not fully satisfactory, even if from the presented results it seems clear that if the group-FDR was itself controlled at the right level, the results would be even more in favor of the group knockoff. It seems that if the group-FDR could be controlled in all methods it would make for cleaner experiments. Also and or as an alternative, the authors could discuss more the potential difficulties associated to controlling the group-FDR for the simple knockoff procedure, and how it would actually in some cases require more work. More generally it seems that some of the discussion proposed in the paper could be further developed and refined: it seems that there are perhaps more things to say and analyze in the results... Detailed comments: In the experiments on multitask knockoff and as commented in the paper, for the parallel knockoff, the FDR is controlled at level 0.2 for each separate regression and that as a consequence the FDR of the "parallel knockoff" procedure is not controlled at the desired level. As a consequence the power of the different methods are not really comparable: the "parallel knockoff" is the one that has the highest power, but with the wrong FDR. Would it be possible to actually control the FDR of each of the procedures so that that the overall FDR of the whole procedure is controlled at the same level as the pooled and multitask FDR? For some reason that is not quite clear to me the FDR of the parallel knockoff seems to vary with the sparsity level. In fact I assume that the same knockoff variables are used for the all the parallel knockoffs, is this what might explain why the FDR is not controlled? Would it be possible to /or make sense to construct a different set of knockoffs for each of the parallel regressions? On Figure 1, in the leftmost plot, the FDR does not seem to be controlled at the same level by the different methods. Why is that? Is it because for the knockoff it is the FDR for individual detections that is controlled at level 0.2 and not the group-FDR? Would it be possible to actually set the FDR level for the usual knockoff so that the group-FDR is controlled at the desired level (This should be easy) ? In particular, I assume that this means that if the FDR was controlled at the right level the power of the usual knockoff would be even lower... The same type of remark applies to the FDR of the knockoff in the cases where the correlation is varied within the group. I do not understand why the FDR of the usual knockoff is not correctly controlled at the right level for all values of correlation. Would it be possible to correct this? It would seem at least quite relevant to discuss these issues in section 5.1.2 In the multitask setting, the case where the design matrices for each task is different has been considered in the literature. How does the method generalize to that case? A few elements are not discussed in the paper, which would be interesting to consider and investigate: In particular: - Intuitively the larger the group the lower the type I error associated with the group should be - Similarly the smaller the correlation between the elements of the group and the lower the type I error associated with the group should be. The experiments show that a corresponding property (for the correlation) holds for the power (which decreases if the correlation increases). A question that would seem of interest is whether is it possible to take into account the group sizes or the different levels of correlations in order the improve the procedure. Clearly the effects of size and correlation are already exploited by the procedure because this makes it possible for the knockoff to be less correlated with the original group of variables, but the approach does not seem to take into account that it is potentially easier or harder for some groups to be false positives. =====