Paper ID: 58
Title: The Teaching Dimension of Linear Learners

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This article studies the problem of constructing an artificial data set as input to various regularized optimization-based learning rules, so as to produce precisely a specified solution to the optimization. (They also study a variant where only the classifier resulting from thresholding need be the same as that specified).  They present this as being a form of teaching, specific to a given learning rule.  They then aim to quantify the minimum possible size of such a data set sufficient for this task, a quantity they refer to as the teaching dimension of that solution (parameter vector) for the given learning method.  They provide fairly general lower bounds, along with upper bounds for several methods (ridge regression, SVM, logistic regression, in both homogeneous and non-homogeneous variants). The bounds are tight (though in two cases, with a tiny gap due to  the effects of rounding to an integer), so that they have indeed identified the smallest number of carefully-constructed samples sufficient to induce any desired optimal solution for these optimization problems.

Clarity - Justification: The paper is very well written.

Significance - Justification: It initiates a fascinating new line of study, albeit one in the very early stages. It is not entirely clear what the actual applications of this research might be, but the quantities involved seem to have some potential importance.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Let me begin by saying I found this a very interesting paper, and this seems to be a fascinating new line of study, albeit one in the very  early stages.  I would not be surprised if future work reveals that  studying the algorithm-specific teaching dimension of certain  algorithms illuminates other aspects of the algorithms' behaviors, or  that the algorithm-specific teaching dimension has connections to  other branches of the literature (much as quantities related to the  original teaching dimension have popped up in other parts of the  literature, such as sample compression, communication complexity of  distributed learning, PAC sample complexity analysis, active learning,  and the topologies of one-inclusion graphs).  For this reason, my  overall opinion of the work is positive, and my recommendation is  acceptance.  That said, I anticipate the choice of terminology in this work may  be somewhat controversial in some circles, with some wondering whether what this work studies is really best described as  "teaching", or perhaps more closely related to something like  sample compression schemes.  For instance, suppose a follow-up  work were to study the question of what learning algorithm has the  smallest teaching dimension; in this framework, that would essentially  degenerate into a coding theory problem, something the prior works  in teaching complexity have taken great care to try to avoid.  The  present work studies familiar learning algorithms, so such coding  tricks aren't really an issue.  However, even here the examples used  by the teacher don't really "describe" the target function in any  traditional sense.  In fact, in some cases (such as the ridge  regression example), the examples (x,y) constructed by the teacher  aren't even labeled correctly! (i.e., y is not equal the real-valued  target function's value at x). Personally, I can empathize with both sides of this issue, so I am inclined to respect the authors' choice of terminology.  My overall opinion is that these results are nice to have, as they provide a kind of ideal-case analysis of teaching, for this family of learning algorithms.  It would probably require significant future work to extend this to a realistic model of teaching in practice (especially by a human teacher).  But in the mean time, the quantity itself may turn out to be interesting for other reasons, as it seems to touch on some important feature of the learning algorithms.  The proofs appear sound to me.  Minor comments:  The paper is very well written in general.  However, there are numerous grammatical mistakes, so it would benefit from a careful read-through to correct these.  One natural question is whether the teaching dimension would be  significantly increased if the examples (x,y) were required to be  consistent with the target function.  (this only affects the ridge  regression example in the present work).  The early parts of the paper were a little confusing to me at first, since equation (1) is stated for the homogeneous case, but there are mentions of the inhomogeneous case (and the usual "reduction to homogeneous" by introducing a constant virtual feature doesn't work here (for upper bounds), since the teacher gets to control all of the features).  This all gets cleared up later when the inhomogeneous case is explicitly introduced, but maybe there can be a forward reference or something.  In Section 4, the teaching dimensions for the inhomogeneous case are stated as being equal 2.  However, I don't see why this follows from the corollaries of the previous section.  Wouldn't they only place it in a range between 1 and 2, due to the rounding differences?  I suspect there's an easy direct argument that both must be at least 2, but I don't see that argument being made in the present manuscript.  Section 5 claims previous works "assume little extra knowledge"  beyond consistency with the training samples.  I know there has  been some work on more favorable families of learners for teaching.  For instance, a good entry point to that literature might be Zilles,  Lange, Holte, & Zinkevich (2011) (which is already cited in the  manuscript).  It might be worth adding some discussion of sample compression schemes, which seem highly relevant to this topic.  These are a family of learning algorithms that produce a classifier consistent with a given data set, but only need to use a small number of the samples in order to do so (e.g., the support vectors, in the case of SVM).  It seems this work, and especially section 4, is essentially studying a special kind of sample compression scheme, with the only exceptions being that (1) the entire instance space is used instead of a finite data set, and (2) the labels in the teaching set are not required to be the target labels.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the Teaching Dimension (TD) --- a minimal size of a sample needed to recover a given predictor when a fixed algorithm is trained on the sample. Apparently, TD has a long history of research and is a rather important topic computational learning theory. The authors extend the definition of TD for the learning algorithms other than ERMs (version-space algorithms) and provide an extensive theoretical analysis of this novel definition (Section 3) when applied to the learning algorithms, Minimizing the Regularized Empirical Risk (MRER). The authors further relax the notion of TD, consider the problem of identifying the correct classification boundary, and provide several results in this direction.  The main results include (a) three different lower bounds for the TD of MRER algorithms (Section 3.1), (b) matching upper bounds for SVM, ridge regression, and logistic regression (both homogenious and inhomogenious) (Section 3.2, 3.3), (c) upper bounds on the TD for the same set of three learning algorithms in the relaxed problem when the student only needs to identify the correct decision boundary (Section 4).   Importantly, all the upper bounds on the TD are presented together with the corresponding teaching samples. In other words, the results are constructive.  

Clarity - Justification: The paper is extremely well and clearly written. The text is very well polished and contains very few typos. The exposition is easy to follow. The authors discuss all the definitions and provide a very intuitive examples. Moreover, the authors try to present a lot of different examples demonstrating the principles and concepts being discussed (for instance, Section 6 consider many different situations illustrating different values of TD in various situations). I think this makes paper especially easy to follow.   Unfortunately, all the proofs appear only in the supplementary material. The authors do not discuss the ideas behind the proofs in the main part. 

Significance - Justification: I am not an expert in this particular field of computational learning theory, where, apparently, TD has its origins. Having a background in statistical learning theory, I am curious if the TD has any obvious connections to the usual statistical setting of the learning theory. In other words, does TD let us control the consistency of certain algorithms or provide exact risk bounds? This topic was ignored in the paper, but I think the paper could be greatly improved if it were to contain these discussions (there is a discussion on the relation to VC-dimension, but I don't find it too convincing).   Given the TD has a long history and its importance, I think the presented results are indeed interesting, novel, and non-trivial. The paper provides an analysis, where the novel upper bounds exactly match the novel lower bounds. I find this type of works very inspiring and theoretically elegant.  Also the paper extends the notion of TD to the classifiers minimizing regularized empirical risk --- a class containing many algorithms commonly used in practice. This is especially important considering the fact that all the previous works considered TD only in context of finite or countable function classes.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Summarizing everything said above, I think the paper contains many novel, non-trivial, and useful results. I also think is clearly written. My only concern is regarding motivation: I am not familiar with the topic of TD (as well as with the literature referenced in section 5) very well and the implications of these new results are not obvious to me.   I think the paper is above acceptance threshold.  ======================== Minor comments ======================== (1) I there a link between version-space algorithms and ERMs? Are they the same? Perhaps, the authors could make this connection explicit for the readers coming from the statistical learning background? (2) It would be nice to include couple of words on the relation between objective (1) and regularization in RKHS? Does the Mahalonobis distance correspond to any choices of the kernel? (3) Also would be nice to discuss assumptions under which the optimization program (1) has a unique solution. (4) As pointed out above, it would be nice to provide a discussion on the relation between TD of the learning algorithm and the popular setting of statistical learning theory: do we get consistency / risk bounds by knowing the values of TD? (5) When the authors say "target model", do they mean that the outputs $y$ are produced as $y = \langle x, w^*\rangle$ where $w^*$ is "the target model"? This was not made precise in the paper. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers an algorithm-dependent extension of the teaching dimension to continuous concept spaces. It then provides an extensive analysis of the teaching dimension for SVM, ridge regression, and logistic regression. Some illustrative simulations are also provided.

Clarity - Justification: This has been one of the better-written papers that I've reviewed for ICML; the authors clearly put a lot of effort into the writing. My only suggestion is to consider including more figures if it seems appropriate to you, perhaps to illustrate one of the trickier proofs or the various definitions (this would more be icing on the cake than a necessary step).

Significance - Justification: [If I could have given "average" as a rating, I would have given that]  As an extension of teaching dimension, the results are interesting, but the paper would have been more compelling if more arguments had been made as to why this extension is the "right" extension of teaching dimension. For instance, there are tie-ins between teaching dimension and dataset poisoning --- does this definition capture those tie-ins? The fact that the concept is algorithm-dependent is perhaps a bit troubling and makes one wonder if there is an algorithm-independent definition that would be conceptually cleaner.  A related thought: the definition of teaching dimension given appears to allow one to essentially "force" an algorithm to have the right concept, sometimes in ways that seem degenerate in that the algorithm doesn't really have justifiable reason to believe in the concept it learnt (i.e. there are many other concepts consistent with the data). I suppose this takes us back in the direction of the original teaching dimension definition for the discrete case. Perhaps my main question is why we don't go with the "obvious" generalization of teaching dimension where we define a metric d(x,y) and call the "teaching dimension at radius r" the smallest training set size such that the set of concepts consistent with the training set fit within a sphere of radius r. I'd be curious what the authors think about such a definition. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Paper is well-written, results are interesting but perhaps the definitions could have been justified more strongly. Still, if we take the definitions at face value then the technical results as they stand are thorough. I think there is enough in this paper that it is worth publishing.

=====