We thank all the reviewers for your time in reviewing this paper and also for your suggestive comments. We would like to make the following clarifications.$ To Reviewer 1: We totally agree with your nice suggestion that the condition you mentioned should be added in Theorem 1 and the bad annotator ID=94 should be marked as red color. To Reviewer 2: First of all, as pointed out by H.A. David in his famous book “The Method of Paired Comparisons”, Gaussian assumption approximately holds in uniform model when multiple comparisons exist for each pair of candidates. It is arguable which model is better, such as uniform, Thurstone-Mosteller, Bradley-Terry-Luce, etc., all of which differ in a link function, while in practice researchers indeed found no one will dominate in all cases and uniform model often works well. Second, our model only adopts bounded variance noise assumption (line 214-216) as required in Gauss-Markov theorem, while Gaussian assumption is only used for theoretical justification of knockoff method for FDR control, which is still quite open for general linear models. Our paper provides further realistic settings that knockoff method indeed works. Setting a threshold on the ratio of left/right answers is misleading, as position biased annotators might very well be a mixture of rational behavior (following the uniform model) and careless voting (e.g. clicking one side when getting tired), see Table 3. A large ratio may be due to either the large preference score difference, or the abnormal behavior. A simple threshold will not distinguish them even one could optimize it. On the other hand, the advantage of knockoff-based FDR control does not need any optimal tuning of parameters, at the cost of allowing certain false-discoveries. A systematic comparison between LASSO and LBI has been discussed in Osher et al. 2016, where ISS/LBI can be more accurate than LASSO in terms of bias reduction. In FDR control, as we discussed in Table 1 and 2, LBI exhibits good performances in simulations (competitive to LASSO in terms of mean F1-scores) while it has a simpler implementation (the 3 line algorithm in Sec. 2.2) than LASSO. For real data studies, LASSO and ISS identify exactly the same set of abnormal annotators at q=10% which shows the stability of FDR control in these examples. Due to the page limit, we omit tables in this version but we could add some back. To Reviewer 3: There has indeed been some historical evidence that workers exhibit such kind of left vs. right bias. For example, in [Day R L. Position bias in paired product tests. Journal of Marketing Research, 1969, 6(1): 98-100.], authors wrote: A phenomenon that has annoyed researchers who have used paired comparison tests is position bias or testing order effects. Besides the four datasets in our paper, we are testing more (e.g. food taste etc.) and find the phenomenon very likely ubiquitous in modern uncontrolled crowdsourcing scenario. However, to our surprise, there have been few systematic and quantitative studies on this problem in modern literature. Here we hope to initiate a statistical framework by exploiting False Discovery Rate (FDR) control in quantization of position bias detection which may just touch an edge of the iceberg. We are exploring other types of annotator-based variations in real data, as the reviewer points out. We will add more details in the KF part of the algorithm. Besides, we will carefully fix all the typos and improve the grammar in the final version. There are 50 bad/ugly annotators in the simulation data. The number of true discoveries is close to 50 means we almost picked out all these annotators. A larger $q$ actually means more aggressive that more annotators are picked out. But since almost all these bad/ugly annotators are picked out, the improvement of true discovery is relative small. A bigger $p2$ means these bad/ugly annotators are more effected by the position bias, so they are usually easier to detect. So as seen in Table 2, the Number of True Discoveries increases slightly as $p2$ increases. However the reversal probability $p1$ does not affect the number of true discoveries, which shows that position bias detection with FDR control is pretty robust to noise . If $q$ is set at a very low level, e.g. 3%, both LASSO and ISS can still provide an accurate detection of position biased annotators (indicated by a control of actual FDR around 3% and Number of True Discoveries around 50). Position effect here just means position bias, e.g. “with probability p2 they always click one side.”