Paper ID: 1030 Title: On the Power and Limits of Distance-Based Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper poses the problem of learning a mapping from one metric space to another (such as a translation of a sentence from one language to another) under the assumption that distances are approximately preserved by the mapping. More specifically, the paper considers a setting in which one has access to two "experts" A_0 and A_1, such that the "optimal selector" among these has low error, where the optimal selector is defined to choose the better of A_0(x) and A_1(x) for each x. The paper then proves lower bounds for such a setting. (Note that such a lower bound would be trivial in the "classic" experts setting, since you could just have A_0(x)=0 and A_1(x)=1 for all x, and you can construct any function at all from these; but what makes this setting more challenging is the requirement of being a low-distortion mapping). Clarity - Justification: The writing is very clear. The proof idea given in section 3.1 is very clear, and there is even a helpful diagram. Significance - Justification: The setup and high-level question posed are very interesting, though it was a bit disappointing that after the initial buildup the paper gave only negative results (see detailed comments). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Positives: The setup and high-level question posed are very interesting. Page 1 was great and made me eager to read more. Negatives: After posing such a great question, the paper then just gives negative results. I was hoping for an exciting positive result. Also the addition of "experts" seems to come a bit out of the blue. Overall: Overall I am positively inclined. The paper brings up an interesting problem and is well-written, so I think it is likely to spark future work, and therefore to make a good addition to the conference. Other comments: - Does a result like Theorem 1 hold if you assume that the optimal selector has *zero* generalization error? Is this an open question? - The paper should expand footnote 1 and motivate the use of "experts" in the introduction. Otherwise, they just come out of the blue. - The paper should be clear that the goal is not to just do "nearly as well as the best of the two experts" which would of course be trivial. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper provides some "no free lunch" style results for structured learning. It shows that for certain metrics over the label space, such as edit distance over binary sequences, PAC style learning of the class of all "low distortion mappings" requires sample sizes that are exponential in the label space dimension. Furthermore, the sample complexity hardness still holds when the learner has access to two labeling rules such that for every instance the target labeling agrees with one of them. Clarity - Justification: The paper is generally very well written. My only suggestion to improve the presentation is that the authors should compare their results to known No Free Lunch results about learning classes of Lipschitz function. Significance - Justification: Bearing in mind that similar results are known for classes of Lipschitz functions the results of this paper are rather expected. However, the current results do not follow from the known results on Lipschitz classes and seem to require some novel technical tricks. (For hardness results on classes of Lipschitz functions, see for example for the binary label case, Theorem 19.4 in the textbook "Understanding Machine Learning" by Shalev-Shwartz and Ben-David, and for real valued output learning of similar classes the known lower bounds, for example in Kearns and Schapire. Efficient distribution-free learning of probabilistic concepts. JCSS 1994). The relaxation of the setup by allowing access to two "experts" does not seem to be very meaningful, as the correct classification is allowed to alternate arbitrarily (over the instance space) between the predictions of those experts. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors should explicitly compare their results to the very related known lower bounds for learning Lipschitz functions (mentioned above). One should note that the "low distortion" requirement analyzed in this paper is more demanding than the Lipschitzness requirement since on top of requiring that similar instances get similar labels, it also requires that dissimilar instances get dissimilar labels. Therefore lower bounds in the context of this paper are stronger than lower bound for classes of Lipschitz functions. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors study low-distortion embeddings between discrete metric spaces (with edit distance as the metric) and provide lower bounds on the sample complexity (number of queries made) in the ‘learning with expert advice’ model. Specifically, they show that for combinatorial spaces (such as the hamming cube) with 2^n elements, one cannot come up with a low distortion embedding with just knowing the true embedding of poly(n) elements and having access to experts. Clarity - Justification: While the manuscript is generally clear, there are parts that are very confusing. Here are a few specific examples: + [lines 83-86] “... r is an isometry … Thus, the example is compatible with the no-free-lunch theorem” I don’t see how the former implies the other. + [line 149] “Despite the above message, our work is still theoretical” I don’t understand how the message (lines 139-147) would lead to a conclusion that theoretical work cannot be done. + [lines 580-587, Definition 5] The definition of cylindric metric space is very unclear.... How exactly is an element being mapped to “[N^2] x [N]” ? and what exactly is the pair (x,y), (x’,y’) \in [N^2] x [N] (ie, what is exactly the [N^2] component and what is the [N] component)? Significance - Justification: (before author feedback) While the authors address an interesting problem, I believe that the final conclusion (that only having knowledge from a polynomial number of queries does not suffice to find a good embedding of exponentially many elements) is obvious. See detailed comments below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (before author feedback) The authors tackle an interesting theoretical problem of finding the query complexity of low-distortion embeddings in combinatorial spaces in the “learning with expert” model. I appreciate the effort put into the analysis, but I simply cannot understand why this result is not obvious. This issue is not specific to low-distortion embeddings and the exact same issue occurs something as simple as classification or regression. Say we want to classify 2^n elements in the same “learning with experts” framework. That is, we are given two classifiers A_0 and A_1, and the perfect classifier ‘r’ is either the output of A_0 or A_1 for each element (ie, the perfect selector S has zero error). Then it is clear that by only querying poly(n) number of queries to ‘r’, one cannot get a good selector S. So we get an “impossibility” result for classification. The catch is that the individual A_0 and A_1 dont have ‘low distortion’ (ie low error). To address the restriction on A_0 and A_1, one can generalize the classification setting to a regression setting and can achieve the similar “impossibility” result under exactly the same conditions as Theorems 1 and 2. In light of this, I believe that the claim made by the authors do not shed any light on the difficulty of the specific problem of “distance-based learning”. This is a general and an obvious phenomenon of selecting between two ‘experts’ for each element. I hope that authors can clarify the issue I raised in their response. Other issues: + The authors mention (lines 302-308) that along with the “impossibility result”, “a number of positive [results] … were discovered”, but none of them are discussed in the text. + From the text, it is not clear how this work helps understand the power / expressibility of distance based learning. In light of this I would recommend the authors to consider changing the title of the paper. Minor suggestions: + It would be instructive to compare the model introduced in this work with the regret framework, ie, how does the quality of the selector fare compared to the best single expert (A_0 or A_1). =====