Paper ID: 1308 Title: Power of Ordered Hypothesis Testing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an extension of a multiple testing procedure due to Barber and Candes (2015) for false discovery rate control in the framework of ordered hypothesis testing. The underlying philosophy is that if the ordering of the hypothesis (for example obtained by a preliminary independent procedure or other prior knowledge) contains some information, i.e. if the false hypotheses tend to be more concentrated at the beginning of the list, then one can take advantage of that fact by restricting one's attention to the first k hypotheses, estimate the FDR on that set, and then choose k adaptively so that the estimated FDR remains below a prescribed level. In this paper, an extension of the method of Barber and Candes is proposed wherein an additional parameter is introduced (with appropriate motivation). The control of the FDR is established under the assumption of p-value independence, and an asymptotic study of power (as the number of hypotheses grows large) also proposed. Experiments on synthetic and real data (with 'naive' choice of parameters) show that the proposed method is competitive against alternatives. Clarity - Justification: The paper is overall clear but feels rushed at times, in particular Section 4 gets a little chaotic and should be polished Significance - Justification: While the paper is an extension of a method proposed by Barber and Candes 2015, it seems to introduce interesting improvements in practice. Theory-wise, apparently the proof of FDR control for SS established by Barber and Candes used very similar methods; The power study on the other hand seems to be mainly based on the work of Li and Barber 2015. In this sense the theoretical advances are somewhat incremental. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Summary: overall interesting contribution showing interesting improvement over previous related methods; possible point of criticism: seems somewhat incremental in comparison to Barber and Candes 2015, Li and Barber 2015. Section 4 should get a few more rounds of polish. * I find the terminology "sequential testing" used at times in the text (and in the running title!) not really adapted since it suggests a stopping time with respect to the observation of the k first p-values only, which is not the case. * There is a slighly confusing terminology concerning Barber and Candes SeqStep method (which apparently is an AT method) and Selective SeqStep (SS) (which apparently is not an AT method). This terminology should be clarified. In fact I was confused by the acronym, maybe SSS could be used as an abbreviation to distinguish clearly from (non selective?) SeqStep. * Can the authors situate better the technical contribution? For example, apparently the proof of FDR control for SS established by Barber and Candes used very similar methods; The power study on the other hand seems to be mainly based on the work of Li and Barber 2015. * The text becomes at times chaotic in Section 4 and should be proofread, polished and possibly clarified. Some sentences are not finished. On l.542, is gamma equal to a? On Figures 2-3, it is somewhat confusing at first that the "strength" of the signal as loosely discussed in the text is in fact smaller when beta grows. * in the experiments in Section 6, from the short description it seems that p-values tilde{p} and p are not indepent of each other, which means that the ordering is not independent of the p-values? This seems to contradict the theoretical stting, could this be clarified? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes a new approach, adaptive SeqStep, to ordered testing that is shown to perform well in settings where either the ordering is poor or the signal strength of the non-null hypotheses is weak. Clarity - Justification: While the ideas in the paper were generally clear, the writing was not. The paper needs to be proofread for grammar and clarity, and may typos. The ordering was inconsistent throughout. I would recommend introducing the new method earlier in the paper. Significance - Justification: I think that this method is an interesting contribution to the field, and the validation showed the niche it fills in the literature for poor orderings and weak signals. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): For someone coming from outside of this literature, the abstract (which was generally well written) and the introduction (which was less well written) were not all that clear. ``pre-specified ordering of the null hypotheses, from most to least promising.'' What does it mean for a null hypothesis to be most promising? What does a ``weaker'' ordering refer to? How are multiple hypotheses tests ordered, generally? I understand the genomics application with multiple dosages, but the motivation in the Introduction was not intuitive nor did it bring up ideas for my own application of interest. The distinction between ordered testing and online testing was only clear after reading the definition of online testing, which is never explicitly defined (i.e., online is infinite). Ordered testing is strangely defined -- if we chose to access the later hypotheses earlier, or in batch, would that be possible? \pi_0 needs to be defined, as does $k$ and $s$ (carefully, since they are referred to a lot later). ``by moving s towards 0'' -> equivalently moving k -> n? Not clear what ``scan far into the list'' really means here. One technical issue with the manuscript: the Zero Assumption (i.e., that null hypotheses will be more likely to have smaller estimates of effect size or p-values respectively; Efron 2008) is made in both Storey's q-value method and also in the AS approach. How might you remove this assumption? Footnote 1: is 12 an equation or a citation? Proposition 1 has too many typos and clauses ommitted to understand what it is saying. What does ``powerless'' mean? what does ``in which the highest chance for the hypotheses to be non-null.'' mean? The whole of 4.1 is unclear and poorly written. There are a number of parameters that are set here. It is not clear why they are set that way (besides \lambda, although not justified well), nor is it clear now sensitive results are to those settings. In the simulations ``which are all common cases in practice'' in practice in what field? It is not true in my area of interest. Simulations are not particularly convincing. I liked the data example a lot though. How the ordering was obtained would be useful, and it is not clear how the BH and Storey methods were run, but the ideas were clear there. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper looks at the ordered hypothesis testing problem which differs from the usual multiple hypothesis testing setup in that the hypotheses are assumed to be ordered a-priori in an informative way. In this setup one can only select the number, k, of hypotheses one rejects, which implies rejecting H_1,...H_k. This problem has only recently emerged in the statistical literature with two papers published in very prestigious journals: JRSS B and Annals of Stats and another currently available in arXiv. This paper offers a useful variation on the Annals published method of Barber and Candes for controlling the FDR called "selective SeqStep". The variation itself is fairly obvious in hindsight, but then again hindsight is never the same as foresight so the authors deserve all the credit for making that observation, which seems to offer a more powerful procedure for controlling the FDR. The latter point is argued both theoretically and empirically. Specifically, the authors demonstrate that, as expected, their method always improves on the original selective SeqStep method, and it also improves on the "ForwardStop" accumulation test of Li and Barber (arXiv) unless the signal (i.e. alternative hypotheses) is both dense and strong. Since the latter is not the typical case, the new method should be preferred in a typical ordered hypothesis testing setup. While in proving that their variation called "adaptive SeqStep" controls the FDR, the authors heavily borrow from the proof of Barber and Candes. The adaptation of the proof to the new procedure is not trivial and is of some interest in its own right. Clarity - Justification: It is difficult to assign a single answer to this question. The problem is this paper has a split personality problem. The introduction and the section that describes the new adaptive SeqStep procedure are reasonably well though far from perfectly written. However, as the authors move to engage in power calculations Mr. Hyde kicks in and presents us with poorly written text that is next to impossible to follow. Take Proposition 1, for example: it changes the definition of \pi(t) that was given just above and it has a sentence ending in "and." Two lines below (367-8) an unexplained "j" pops up in the equation, as does an unexplained "f" in the line below. The corresponding proofs in the appendix are also rife with errors. Significance - Justification: I would actually rate it between "Excellent" and "Above Average" for the following reason. In terms of the potential utility of the novel adaptive SeqStep it is undoubtedly "Excellent" since it seems like it should be the preferred default method for analyzing ordered hypotheses. However, in terms of the scientific distance to previous works I would place it "Above Average". Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Perhaps it was rushed but this paper feels it could have been significantly better. There are so many errors in the paper that I found myself wondering why should I spend my time carefully reading this paper when the authors clearly didn't spend the necessary time to create a decent presentation of their work. At some point I gave up on trying to follow the details of Sections 3 and 4.1. Even the earlier and better written sections have their share of errors, from simple spelling ones, e.g., "unifrom", to mathematical errors, e.g., "n" instead of "k" in the denominator of 229-230 and in the Appendix proof of Theorem 1, and an unexplained "S" (presumably it should be H_0) in the same proof, to list just a few. At the same time, approaching this work from a user's perspective I feel that too little thought has been given to the critical question of practically setting the values of the parameters "s" and "lambda". That said, I think the new adaptive SeqStep is important enough to overlook the numerous faults of this paper. I would give it an overall rating of "accept" only I don't have this option so I'd rather err on the side of "Strong accept". =====