Paper ID: 253
Title: Parameter Estimation for Generalized Thurstone Choice Models

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of parameter estimation for generalized Thurstonian model. Lower and upper bounds are given for the accuracy of the likelihood estimator. The bounds for pairwise case depend on the connectivity of the pairwise comparison graph. The bounds for the more general case when the cardinality of comparison set is > 2, depend on set of distinct values of cardinalities of comparison. These results are also extended for the classification case when the goal is to classify the items whether its skill parameter belongs to the upper half or not.  The theoretical findings are supported by conducting numerical experiments.  

Clarity - Justification:  It was not easy to follow the notation. A table containing the key notations might help to follow the paper, because I spend significant time with looking for some notation.   Only some paragraphs should be improved see comments.  I think Remark should be used instead of footnote. 

Significance - Justification: The bounds are interesting however they are hard to understand them because the constants used are hard to intuitively interpret.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Comments:   Regarding p.1 line 73-77: I think there is an inconsistency in the notation. \Tehat_i is a real number, and not a real number and cdf.  The whole paragraph should be rephrased.  The condition G2 does perfect sense in Theorem 1, because if there is two subsets of the items, say A and B, such that any item of A is not compared to any item of B then we cannot hope for good estimation of parameters uniformly for every component. But what I don't see why G2 is needed for the concavity of the pairwise likelihood? (This is written before Theorem 1.) Moreover G1 is not used in any statemnet.   In A1, A2 and A3 the constants are defined which the number of comparison sets depend on in Theorem 3. These constants characterize some properties of the distribution of the skills. Some discussion might be added on why these constant are used and what do they intuitively mean.   The lower bound given in Theorem 3 is of order n/m whereas the upper bound is order (n \log n ) /m . It might worth discuss what is the reason of this gap between the lower bound and the upper bound.     In the experiments, the Fiedler value of some real dataset is computed (Subsection 6.2). This is indeed an interesting finding but, in my opinion, it is not strongly related to the topic of the paper which is the accuracy of the parameter estimation. This part might be moved to the appendix, and instead, some discussion of the constants used in Theorem 3 might be added to the paper. As far as I understand the discussion in Section 5 concerns to the unbiased case.    

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the generalized Thurstone model when we have top 1 choice data. An MLE estimator is described for this case. The main contribution of the paper is considering properties of the MLE estimator and providing conditions that guarantee convergence with a bound on mean squared error. The bound is connected to the cardinality of comparison sets as well. Empirical results are shown to validate the claims. 

Clarity - Justification: This paper could be presented with more clarity, generalized Thurstone model is used without proper reference. It is not clear whether this paper is proposing a generalized Thurstone model or if they are just using it.  The goal of the paper is not described clearly, the model and estimation exists and this paper studies the properties of MLE estimation. The paper misses a bulk of recent work from A. Ammar, D. Shah, H. Azari, and S. Oh on estimation of parameters and bounding of error for Random Utility Models and rank data models.

Significance - Justification: Paper discusses properties of an existing estimator and describes a setting for computing error bounds on the estimators. It is not clear how these results would help the estimation problem in general and this approach is not compared with any competitive method or algorithm.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper proposes some new results as bounds for errors in MLE estimation However, I have two major concerns with this paper:  1- Presentation: Authors consider top-1 data as the general data while this is not true. Generalized Thurstone model is also more complex than just top-1 data. The application scope is limited in this case and it should be addressed.   2- Practicality of results: The results and bounds are an addition to current literature. However, I am not sure how they would help us with modeling rank data. I expect some comparisons or at least illustration of significance of these results with other rank aggregation  methods on the real data. The current empirical results validate the claims, however, they do not show practical significance.   Minor: Line 346: of of -> of

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends an early study by Hajek-Oh-Xu'2014 on maximum likelihood estimators of Plackett-Luce models to generalized Thurstone models (also called Thurstone models or Random Utility Models in literature, e.g. Hajek et al. 2014). Basically the paper presents upper bounds for mean square error of MLE for such general models which in particular inversely depend on the Fiedler value of pair-weight matrix graphs, and as well a lower bound, which qualitatively meets the similar ones given in Hajek et al. 2014 for general Thurstone models. The paper also analyzed the case of top n/2 partial ordering, which shows the upper bound of a Borda-count type estimator meets a lower bound and is thus tight in this particular case.   Experimental studies with simulated and real world data are conducted to investigate the dependence of mean square error on comparison set capacity in different models and Fiedler values of pair-weight graphs. 

Clarity - Justification: The paper is generally well-written. Yet the authors should keep the terminology as consistent as possible at least to the reference they cited. For example, generalized Thurstone models in your paper are essentially the Thurstone model or Random Utility Model (RUM) in Hajek et al. 2014 in definition 2, where the utility $U_i$ can be given by $p_{i,S}(theta)$ which only depends on the difference between score $\theta_i$ and others in set $S$. 

Significance - Justification: The upper bounds for MLE of generalized Thurstone models have never appeared explicitly in literature, as an extension of early work by Hajek-Oh-Xu 2014. The author should mention that the lower bounds in Hajek et al. 2014 are actually applied to the general Thurstone model or RUM, but only upper bounds are restricted to MLE for Plackett-Luce model. The tightness of Borda count estimator in top n/2 ranking is new, up to my knowledge.  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As the authors point out, the Fiedler value of pair-weight matrix graph is important to control the mean square error rate. Perhaps the following references on controlling Fiedler values are related to your discussions:  (1) A greedy algorithm to find expander graphs with large Fiedler values: B. Osting, C. Brune, S. Osher, Enhanced statistical rankings via targeted data collection, International Conference on Machine Learning, 2013, pp. 489-497.   (2) The relations on Fiedler values between large random graphs and greedy expander graphs: B. Osting, J. Xiong, Q. Xu, Y. Yao, Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs,Applied and Computational Harmonic Analysis, 2016

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the generalized Thurstone model (which has an unknown parameter vector theta and a known parameter distribution F) for comparing n times where the number of  data points is m, and each data point y_t reveals the top choice in a set S_t. The sizes of the sets {S_t,t in [1,m]} need not be the same. In this setting, the authors give upper bounds for the ML estimate of theta in terms of the distribution F. They also show that for a subset of the parameter space where the co-ordinates of the vector theta are allowed to take two values, the MLE performs optimally.

Clarity - Justification: The problem is clearly formulated, and the results are well explained. There is one minor comment (see (ii) in the comments section)). I did not go through the proofs in the supplementary file.

Significance - Justification: The paper considers an important problem, and gives bounds to the performance of the MLE. The ML algorithm assumes that the distribution F is known and is usually computationally expensive, but still can be seen as the benchmark algorithm against which other algorithms can be compared.  There are however some concerns about the implications of the results obtained, and these are explained in the comments section.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (i)Can the authors comment on the numbers A_{F,b}, B_{F,b} and C_{F,b} of Theorem 2, for a couple of distributions other than the double exponential? Is it always the case that they can be chosen free of K (the largest comparison batch size) as done in footnote 4?   (ii)Conditions on the distribution function F or the density function f seem more natural. Instead, conditions on the function p_k(x) are harder to check, and seem somewhat artificial. Can the authors give reasonable sufficient conditions on F/f under which the assumptions on p_k(x) hold? A similar comment can be made for Theorem 4.  (iii)How are the choice of weights (w(1),..,w(m)) related to the ML algorithm?   (iv) The letter m has been over-used. It is used as entries of a matrix m_{i,j} on page 2, and also to denote the number of samples on page 1. I would suggest changing one of the notations.   (v)The letter w has also been over-used. w(.) is a weight function, and is also a matrix w_{i,j}.  (vi)In the performance of ML, small comparisons seem to be better than large comparisons (i.e. k small is better than k large), as \gamma_{F,k} seems to decrease with k for most choices of F. In there any intuitive reasoning for this? Wouldn't it be better to observe the whole ranking every time? 

=====