We thank the reviewers for their thoughtful and encouraging comments on the significance of our work. We also thank them for pointing out numerous presentation issues ranging from typos to poor organization, which we will work hard to correct in the new draft.$ We begin by addressing questions raised by multiple reviewers, then answer reviewer-specific questions. 1) Sec. 4: Here we give a detailed quantitative power comparison. We see now that this section is far too dry, with too little intuition. We will put some derivations in the appendix and streamline the section, reorganizing it to emphasize two main points: First, that AT is more fragile than AS. Specifically, when Pi(0) is small (i.e., most of the early p-values are null), AT is powerless no matter how we choose lambda, whereas AS always has nonzero power if we choose s small enough. Second, AS achieves better power than SS by estimating the null proportion. 2) Choosing s and lambda: This is an important point which we will expand on in the next draft. First, as discussed in Sec. 5, setting lambda=0.5 is a good default if most p-values larger than 0.5 are null. As for s, in the gene example we use the heuristic s=q (we will also add this choice to our simulations). To explain this choice, we can see that s=q gives non-zero power if F1(q)/(1-q) > (1 - Pi(0))/Pi(0): a direct result of equations (8-9), which we will derive in the new draft. If, say, q=0.1 and F1(q)=0.5, then s=q is small enough as long as Pi(0) > 0.64. That is, if the non-nulls have reasonably strong signal and most of the early p-values are non-null, then s=q is small enough. Note s>q is typically undesirable: it would mean using a more liberal rejection cutoff than unadjusted testing. If we expect few non-nulls early in the list, or a weak signal among the non-nulls, we can apply similar logic to assess smaller values of s, but there is no universal value that works for all problems. Note that the other SeqStep methods also require choices of s. An interesting topic for further research is whether we can estimate a good value of s from the data. 3) Prop. 1: We apologize for the stray “and,” and the confusion it created. With “and” deleted, Prop. 1 is correct as stated. We will also improve notation in the next draft. Reviewer 1: 1) “Most promising hypotheses,” “weak ordering”: We will clarify these phrases and give examples. “Most promising” means hypotheses with the highest a priori chance of rejection. For example, when testing for association with disease A, evidence from another experiment could suggest certain genes play a major role in related disease B. We might then put those genes early in the list. A “strong” ordering is one that concentrates the non-null hypotheses at the beginning of the list (ideally, with all non-nulls coming first, then all nulls). A “weak” ordering means the opposite. 2) Zero Assumption: Very interesting question. All methods in our paper enjoy finite-sample FDR control without this assumption, but if it holds the power of AS will be somewhat better. We will make this point. 3) Values of n “common in practice”: this was poorly worded since many genomic applications (including our real data example) have n >> 1000. We simulated with smaller n mainly because we were investigating finite-sample performance, since large values of n should be covered by the asymptotic theory. We will add the case n=10,000 to illustrate performance for large n. 4) Other clarifications: “(12)” in Footnote 1 refers to equation (12) in Li and Barber; and BH and Storey’s procedure were run on the dose-response data as usual (i.e., without regard to the ordering). We will clarify these points, as well as the definition of online testing, in the new draft. Reviewer 2: 1) "Sequential Testing": we will replace all occurrences with the clearer phrase “ordered testing.” 2) Abbreviating “selective SeqStep” as SSS: We worry that the analogous abbreviation for “adaptive SeqStep” could threaten our G rating! 3) Technical contribution: we agree that we borrow heavily from Barber and Candes (2015) and Li and Barber (2015). However, when proving exact control, the martingale structure is more complicated, requiring an extension of Barber and Candes' proof. For the asymptotic power analysis, we adopt the framework of Li and Barber (2015) but the analysis is different although at a high level, both proofs rely on the concentration properties of target quantities. 4) Real data example, independence of p-values and ordering: Good question! We neglected to mention that the low-dose vs. control p-values are from permutation tests. The ordering is based on t-statistics comparing high-dose to *pooled* low-dose and control data; by construction these t-statistics are invariant to permuting between low-dose and control data. Therefore, the null p-values are independent and (discrete) uniform conditional on the ordering, as required by the theory.