Paper ID: 600 Title: False Discovery Rate Control and Statistical Quality Assessment of Annotators in Crowdsourced Ranking Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors extend the knockoff filter or Barber & Candes 2015 (a method for variable selection with guarantees on the false discovery rate, i.e. the proportion of variables incorrectly added to the model) to a case where there are both dense and sparse parameters. They apply it to the detection of "position-biased annotators" in crowdsourcing (annotators that would favor one position e.g. left or right, when presented pairs they have to compare). They also advocate for the use of Linearized Bregman Iteration rather than LASSO estimators to compute the regularization path on which they apply the filter. Qualitative experiments are exhibited on several datasets to illustrate the results of the application of their approach. Clarity - Justification: The paper is well-written and well-motivated Significance - Justification: The problem addressed by the authors is interesting. However, the results seem incremental given the work of Barber & Candes. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Even though the problem addressed by the authors is interesting from a practical perspective, I think the paper has a few weaknesses: (1) the authors make the assumption that the response variable is gaussian. This assumption is rather debatable when considering pairwise comparisons. This somewhat weakens the theoretical part, since the algorithm is applied in a setting where the guarantee is not proved. (2) the paper lacks comparative experiments and baselines. While the authors justify somewhat convincingly that there approach should be viable, they only show qualitative results of their method. Some baselines to detect position biased annotators should however be fairly easy to design (e.g. set a threshold on the ratio of left/right answers); some experiments with LASSO rather than LBI would also be welcome to back up the claim that LBI should be better. Last, some quantitative results showing the performance of the approach on some well-defined metric would help the comparison and the analysis of the results. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduced False Discovery Rate (FDR) control to detect abnormal annotators, an important problem in crowdsourced ranking. For this purpose, the paper assumed a linear model for pairwise rankings, with annotator specific intercepts which measure the position bias of workers and are assumed to be sparse. Knockoff filters proposed by Barber-Candes'2015 are designed for this model to achieve FDR control. The paper also introduced a new sparse regularization based on Inverse Scale Space (ISS) and its discretization, Linearized Bregman Iterations (LBI), for the advantage of debiasing and potential parallelization in large scale implementation. A supporting theory is given to show the knockoff filters indeed realized FDR control with ISS and LBI, etc. Numerical studies are presented with both simulated and real world data to show the validity of proposed FDR control in capturing abnormal annotators. Clarity - Justification: The paper is well-written in general with a clear logic flow. But it would be better to add more comments on the intuition of ISS and knockoff filter design since both ideas are new to machine learning society. Significance - Justification: FDR control with knockoff filters is recently developed by Barber-Candes'2015 in statistics for linear regression in the setting of LASSO estimator. Inverse Scale Space (ISS) method firstly appeared in Burger et al. 2005 in applied math for image restoration and was recently used for sparse linear regression in Osher et al. 2016 with some better properties than LASSO. Up to my knowledge, both methods have never been studied in machine learning society. The paper extends knockoff based FDR control to ISS. The application to abnormal annotator detection in crowdsourced ranking is novel and convincing. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (1) The knockoff method by Barber-Candes'2015 in linear regression typically requires that the number of samples is at least twice of the number of parameters. Such a condition is implicitly assumed for the new model in the knockoff filter construction on line 355: $|E|\geq 2|A|+|V|$. It would be better to state such kind of important constraints explicitly in conditions of Theorem 1. (2) In table 3, annotator ID=94 is also a 'bad' annotator who only clicks one side, which was missed in color mark. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper develops an algorithm for false positive rate control in detecting a particular kind of biased annotator in crowdsourcing platforms, the type that clicks the option on one side significantly more often than the one on the other side. The authors formulate the problem as a linear model with a position fixed effect, and combine Inverse Scale Space (ISS) and Knockoff Filter (KF) methods. The resulting method is unbiased and can provide an upper bound for the expect false positive ratio in predicting the biased annotators. Specifically, ISS uses nonlinear differential (subgradient) equations instead of LASSO to avoid bias. The KF method constructs a “knock-off” feature matrix that has the same covariance structure as the original one. By combining the original and “knock-off” feature matrixes and applying ISS with a different penalty parameter, the biased terms can be selected out. Both simulations and real data show that this algorithm can detect biased annotators and the average false positive rate is bounded. Clarity - Justification: The paper is well-organized and clear. Most of the paper is understandable. But some parts should be illustrated in more detail (see detailed comments). Significance - Justification: This paper potentially makes a nice incremental contribution to the literature on quality of crowdsourcing workers, but there is one big piece missing that the authors need to be more clear on. Is there prior evidence or a conjecture that workers exhibit this kind of left vs. right bias? In that case, this is a reasonable contribution that addresses a known problem (assuming others have not tried to do so). If not, it's possible that the experiments make an additional contribution by demonstrating the existence and extent of this bias, and that could be a bigger contribution if the authors were able to show it across more datasets. In terms of algorithm development, as far as I can tell this seems like a very reasonable algorithm that builds on prior work in order to accomplish the task the authors set out to for their application. It seems conceivable that this model could be applied as part of a pre-processing step in different crowdsourcing settings, and perhaps generalized to find forms of systematic bias other than left vs. right as well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (1). The KF part of the algorithm should be explained in a little more detail. It's quite hard to follow. (2). Line 370. $Z_j=\sup\{\lambda:\hat{\gamma}(\lambda)\neq 0\}$ should be $Z_j=\sup\{\lambda:\hat{\gamma}_j(\lambda)\neq 0\}$. Similar issue in Line 371. (3). In the simulation experiment, the false positive bound parameter $q$ was changed from 10% to 20%. The number of true discoveries haven’t changed with the $q$. Only the false positive rate changed. What is the relationship of $p$ with the number of true discoveries in experiments? What if $q$ is set at a very low level, e.g. 3%? (4). Could you provide more details on the “position effect” referred to in line 511? =====