Paper ID: 1306 Title: Learning Mixtures of Plackett-Luce Models Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addresses the question of identifiability of mixtures of PL-models and provides partial answers to specific aspects of this question. Moreover, the authors propose a GMM algorithm for estimating the PL mixture as an alternative to an existing EMM algorithm. Clarity - Justification: see detailed comments Significance - Justification: see detailed comments Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem addressed by the authors is an interesting one, and the formal results presented are clearly interesting. Comments: The motivating examples given in the first paragraph of the introduction are not really appropriate, simply because these are not examples in which the set of objects to be ranked is a finite set. For example, in ranking websites or in information retrieval, each problem typically involves a new set of objects (websites/documents) that have not been seen before, and generalization is done via features of objects. More appropriate would be other ranking problems for which the set of items to be ranked is indeed finite and remains unchanged, such as label ranking. For example, the PL model has been used for label ranking in "Label Ranking Methods based on the Plackett-Luce Model" by Cheng et al. (ICML 2010). Occasionally, the paper is a bit sloppy. For example, it's not correct to say that the PL model is a natural generalization of logistic regression, because the former is a probabilistic model while the latter is a method for binary classification. In practice, observed rankings are often incomplete, i.e., they do not involve all m items. One of the nice features of the PL model is that it easily extends to this case, i.e., that it allows for a very simple computation of marginals of the complete distribution. Since the authors seem to assume complete rankings as observation throughout the paper, it would be interesting to have a remark on generalizations to the incomplete case. If I understand correctly, the GMM approach has been worked out only for k=2 and m=4. How does it generalize to other mixtures and longer rankings? Since the experiments are limited to this setting, too, they are not very conclusive. As for the running time, which seems to be the advantage of the authors' method, one should of course note that the running time of EMM strongly depends on the threshold used for testing convergence. Couldn't the efficieny of EMM be improved at the cost of a slightly worse accuracy (which might still be as good as the one of GMM)? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addresses the problem of learning a mixture of k Plackett-Luce ranking models (over m items). Plackett-Luce models are popular probabilistic models for generating rankings over m items (based on a score vector \theta). These are very popular models in various fields, and learning a mixture of these ranking models is a well-motivated question considering heterogeneity of populations. This paper gives three main results (all of which were not known earlier): 1. Instances showing non-identifiability of this mixture model when k > (m-1)/2 2. Generic identifiability of these mixture models even when k<= (m/2 - 1)! (factorial) 3. A method-of-moments based algorithm (no efficiency guarantees) for learning mixtures of 2 PL models. While the techniques used are not super surprising, I think these answer the first questions that one would ask about learning mixtures of these ranking models. Hence, this paper should be accepted. Clarity - Justification: The paper is generally well written. However I have a couple of comments: 1. The paper builds on a lot of existing work (particularly Allman et al), and it would be useful to reproduce the statement of the result (Corollary 3) used here, and explain why it applies. 2. Section 4: the approach based on Generalized Method-of-moments needs to explained better (and should involve less hand-waving). For instance they should give a mathematical definition of the functions g. Significance - Justification: Mixtures of ranking models are studied in different communities (ML, statistics, Information retreival, social choice etc.), and there is a lot of work on learning such mixtures. Plackett-Luce is one such ranking model that is widely used (along with Mallows and Bradley-Terry models). This paper answers the primary questions one should address about this model, before even hoping to learn these mixtures of PL models from data is: are they even identifiable? for what settings of parameters do they become non-identifiable? While theoretical guarantees have been generally lacking, in the last couple of years there has been work on proving identifiability and obtaining algorithms with provable guarantees for learning these models, using method-of-moments based approaches. This paper continues in this line of work and provides such a theoretical understanding for the mixtures of PL models. I particularly like their clear and elegant characterization of identifiability in terms of a simple rank condition of a certain natural matrix (Lemma 3). I think this is an important work in the context of ranking models that should lead to future work. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. I think there is other recent work about learning mixtures of ranking models, that should be referenced. One particular work is by Chierichetti, Kumar, Dasgupta, Lattanzi on learning mixtures of Mallows models --- these results also show non-identifiability for k being large in terms of n, and they give algorithmic results when the central permutations are far away. Also, it is important to mention that Chierichetti et al., and Awasthi et al. both give algorithms that run in polynomial time (and samples), and not just give identifiability results. While these are for Mallows and not for PL model, there are many similarities, and are more related to this work than gaussian mixture models. 2. I think it is good to mention the technical similarities to other work that use the method-of-moments approach using low-rank tensor decompositions (as in Allman et al.). Apart from the work on Allman et al. that focussed on identifiability, Awasthi et al. also define an appropriate concept of "moments" for rankings, and obtain learning guarantees using tensor decompositions (and method-of-moments). 3. Notation: It is confusing to use superscripts like 2 in \alpha^2_k where 2 is an index and not the square! Please use brackets on the superscripts e.g. \alpha^{(2)}_k and \theta^{(r)}_m. 4. Typo in equation 4: \theta vs \theta'. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the problem of learning a mixture of Plackett-Luce models. It shows that a mixture of Plackett-Luce models is not identifiable if the number of components in the mixture, k, is such that k >= (m+1)/2, where m is the number of alternatives being ranked. On the positive side, it shows that a mixture of Plackett-Luce models is identifiable if k <= floor((m-2)/2)!. The paper also proposes an algorithm, based on generalized method of moments, and show that this algorithm provides a consistent estimate, and that it can be empirically much faster than a previously known algorithm with a somewhat larger mean square error. Clarity - Justification: The paper is overall well written. The definitions and main claims are clearly stated. Significance - Justification: Learning of mixtures have been studied for Gaussian distributions, and Mallow's ranking model. The paper provides new results for mixtures of Plackett-Luce ranking models. The Placket-Luce ranking model is a canonical ranking model in theory and practice - e.g. in practice many rating systems are based on so called Bradley-Terry model, which is a special case of the Plackett-Luce ranking model. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I like the problem and the results, and the presentation of the paper. I have only a couple of minor comments. Line 308 would be good to provide some explanation why we consider 2k theta parameters. Section 4 - the algorithm based on generalized method of moments could be defines somewhat more clearly. Line 622 refers to the concept of a moment condition, which could be defined more clearly. =====