R1:$ Very good question on what the 'right' teaching dimension extension is. Yes, there are several ways to extend TD, and we by no means imply that our learner-specific TD is the only valid extension. We chose to study this specific extension because it is (i) mathematically interesting and (ii) already useful for a number of applications. For example, we are collaborating with psychologists who propose cognitive models (i.e. learning algorithms) of human students. To the extent those cognitive models are good, our TD then characterizes the minimum 'lesson size' for the student to learn a target concept. We are conducting human experiments to validate the optimal teaching set associated with such TD. As another example, in the said data poisoning attacks it is possible to characterize the minimum poisoning in terms of the number of original training items to change (It is slightly more complicated because the changes are w.r.t. a given original training set.) We think of our learner-specific TD as 'ask more (information about the learner) and get more (stronger guarantees)'. It is perhaps conceptually less 'elegant' than a learner-independent TD for requiring this extra knowledge of the learner, but it is reasonable for certain situations. One interesting middle ground we are currently pursuing is the setting where the teacher only knows the learner up to a family (e.g. an SVM but with unknown regularization weight lambda). And of course, the reviewer's suggestion of 'TD at radius r' is another perfectly valid extension of TD. Internally within our group, we have been calling this extension epsilon-approximate TD (replacing r with epsilon). Interesting, there has also been prior work on this extension, albeit from a different (but equivalent) angle: "Complexity of Teaching by a Restricted Number of Examples" (COLT'09) by Hayato Kobayashi and Ayumi Shinohara who taught with a training set SMALLER than the classic TD and studied the corresponding value of epsilon. Furthermore, such extension is absolutely necessary in the so-called pool-based teaching setting, where the teacher cannot arbitrarily construct training items from the whole input space, but instead must choose from a (finite) given pool of candidate training items. This is because one cannot hope for perfect teaching using a pool. We are in fact looking into epsilon-approximate TD in on-going work. In summary, we agree with Reviewer 1 that there are multiple ways to extend classic TD. This paper focuses on one of the extensions that we feel is meaningful. Many others remain open questions for future study, and we hope our paper generates a dialogue in the community. R2: We thank the reviewer for your comments. We are not aware of further connections between TD and consistency/standard risk bound beyond the (loose) relation between TD and VC-dimension in [Goldman and Kearns'95]. This is an interesting open question for future research. We will address the other minor comments in the revision. R3: The reviewer's sharp eye noticed that our teaching items (x,y) may not always obey the target concept. For example, in Proposition 1 our y1 is slightly off the target value of x1 theta* = ||x1||^2/a. Mathematically, this is a direct consequence of the learner coming into learning with its built-in prior (in this case the ridge term), which biases its training experiences toward the zero parameter. The optimal teacher knows this and designs 'overshooting training data' to exactly compensate for it. Overshooting teaching strategies have appeared before, see e.g. 'Machine teaching for Bayesian learners in the exponential family' (NIPS'13) by Zhu. More broadly though, this opens up an ETHICAL question: when is it appropriate to teach with items deviating from the 'truth'? For example, is it ethical to fake the outcome label of a patient (in classification) in the name of better training, even if that (patient feature, outcome) pair never appears in practice? This ethical question is particularly relevant for applications of optimal teaching to human education as discussed with Reviewer 1 above, and is part of our on-going research. Eq. (1) actually covers both the homogeneous case and the inhomogeneous case. Note that A in the regularization term of Eq. (1) is a PSD matrix. If it is the identity matrix I, it corresponds to the homogeneous case, while [I, 0; 0, 0] corresponds to the inhomogeneous case. We will make it more clear in the revision. Thanks for pointing out Zilles et al. 2011 and sample compression. We will adjust our statement and discuss the relationship between our work and sample compression in the revision.