We thank the reviewers for their positive feedback. $The reviewers made good suggestions and we will make sure to correct the paper accordingly. It seems there are few points that need clarification at this stage: - The range of the loss: Indeed, we assume that the range of the loss is [0,1]. By mistake, this was not mentioned explicitly in the statement of Theorem 1. Note that the range of the surrogate loss, $\ell^s$, does not have to be [0,1], and therefore we can use hinge-loss and similar convex surrogates. Note also that subsections 2.1, 2.2, explicitly deal with this assumption and explain how to make sure it holds. - The sentence ``realizable w.r.t. some surrogate loss'': This means that the first term on the right-hand side of Equation (4) is zero. We would state it explicitly. - The value of $c$ vs. $\epsilon$ in Theorem 2: The goal of Theorem 2 is to show that in some situations, a constant number of examples from $D_2$ is sufficient to guarantee that the loss on $D_2$ would be small. Naturally, this will not always be the case, but, we show data dependent conditions, under which a constant number of examples from $D_2$ is sufficient. As a motivation, consider Figure 1, and suppose $H_{1,\epsilon}$ contains conjunctions over all subsets of the features ``has eyes'', ``has nose'', ``has mouth'', ``has skin color''. Let $h^*$ be the conjunction of all these 4 features. It is reasonable to assume that examples in $D_2$ lack one of these features. Let us also assume for simplicity that each lacking feature takes at least $1/8$ of the mass of $D_2$. Hence, the error of all ``wrong'' functions in $H_{1,\epsilon}$ on $D_2$ is at least $1/8$, while the error of $h^*$ is $0$. We see that in this simple example, $c = 1/8$. (BTW, to reviewer 3, the confusion is because $c$ is defined to be the error w.r.t. $D_2$ while $H_{1,\epsilon}$ is w.r.t. $D_1$.) - Height of the tree: Indeed we had a typo there. Thanks for pointing out and sorry for the confusion. - Comparison to Online Boosting (Beygelzimer, Kale, Luo, ICML'15): The setting is completely different. In BKL, there's a weak online learner as a black box, whose error rate is $(1/2-\gamma)$, and the goal is to obtain a strong online learner, whose error rate is $\epsilon$. Here, we have a strong online learner as a black box, and the goal is to solve a batch optimization problem fast. - Reviewer 3: We will make use of your suggestions to improve the clarity of the ``How'' section. However, regarding your claim that the strategy for the ``p'' player was never given formally, please note that we provide a detailed pseudo-code (Section 2.3) that gives all the details of implementing the strategy efficiently (using the Tree pseudo-code and the call to this procedure from the FOL pseudo-code).