Paper ID: 1 Title: No Oops, You Won't Do It Again: Mechanisms for Self-correction in Crowdsourcing Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of correcting answers given by crowdsource workers. Concretely, it investigate mechanisms of the following kind: after the worker gave her answers, she gets to see some 'reference answers' that obtained by either the answers of other workers, or is a gold standard. After being exposed to the reference answers, she can change her mind. The paper investigates incentive compatible mechanism, that incentivize workers to act according to their true beliefs in both stages. The main results are: 1) Ensuring incentive compatibility in the first stage can be done by a simple mechanism. 2) No incentive compatible mechanism exists for the two stages. 3) If the requirement to tell the truth in the first stage is relaxed to "tell the truth only if you confident enough", there are infinitely many mechanisms. 4) By adding a "no free lunch axiom" that dictates no payment if the worker answers arbitrarily and fixes its answers according to the reference, there is a unique incentive compatible mechanism that minimizes the confidence requirement 5) The authors empirically show that using this mechanism improves results of learning algorithms. Clarity - Justification: The paper is well written Significance - Justification: The paper takes an interesting problem and gives an elegant and original solution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A great paper! See significance section. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a two-step mechanism to allow for "self-correction" based on their own reasoning, or against a provided reference. Theoretical results are provided for binary choice tasks with the use of "gold" standard questions as a measure of user task completion performance. An incentive compatible mechanism is proposed through the use of a slack variable. The performance is presented through the classification of simulated crowdsourced labels using standard machine learning algorithms. Clarity - Justification: As it is central to the mechanism design, more detail should be provided on the choice of reference and its impact on the mechanism. More detail should be provided on how the second stage introduces bias. It would be useful to state that some tasks may not be compatible with a two-stage approach. (without reviewing or reading the appendix) The binary task seems to heavily increases the bias to the reference in the second stage. It would be useful to have a more detailed explanation or intuition on how the mechanism is impacted by an increased number of choices. Significance - Justification: P. G. Ipeirotis, F. Provost, and J. Wang. Quality Management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP, pages 64–67, Washington DC, 2010. show that biased labelers who systematically mislabel tasks are valuable. If G is less than N and the labels are all incorrect or all correct, but the remaining questions have some useful information, your no-free-lunch mechanism would provide no reward. You note that a reward is still possible, but it seems your setting ignores the bias removal methods proposed in the literature, such as in Wauthier, Fabian L., and Michael I. Jordan. "Bayesian bias mitigation for crowdsourcing." Advances in Neural Information Processing Systems. 2011. The experiments did not provide a realistic representation of crowdsourced task completion performance. It seems reasonable to expect more realistic empirical results. The existing experiments were not convincing. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I enjoyed reading the paper. I would have liked to see more of the theory in the paper and not in the appendix. I understand there are space restrictions, but the content seems like it would benefit quite a lot by including some of the results. You should note that the proofs are in the appendix. You should try to include some portion of the proofs the make the results more clear. l45: seems like there's a missing word "labeled crowds" l47: rewards seems like a clear replacement for "payments" l72: "much emphasis" seems like it could be improved for clarity l73: "implies requirement" is not clear Figure 1: would be nice if the images were larger. It's hard to see. Additionally, this image might be useful in a discussion of how multiple tasks might impact the result. l131: "intimated of" is not clear ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors propose an improved way of crowdsourcing data labels that are to be used in supervised machine learning (in particular, binary classification): Each worker is administered a set of two-step queries; in the first step, she picks one of two labels for a given data instance, and all her responses are compared to a ‘reference answer’ generated from her peers’ responses; in the second, she is shown the reference answer for each query on which she had a mismatch with the reference answer, and is given a chance to change her answer (self-correction). Her compensation is based on her two-step responses to only a number (known to her) of gold standard queries dispersed uniformly at random through out her task set. In the first part of the paper, the authors design a mechanism, described in Algorithm 1, for this problem that satisfies ‘incentive compatibility with margins’ (defined in Section 4.2) and a ‘no-free-lunch axiom’ (Definition 1): Informally, it incentivizes the worker to choose that option in the first stage for which her (prior) personal probability exceeds (1/2+’slack’), and that option in the second stage for which her posterior probability after observing the reference answer (different from her own) exceeds some T > 1/2, the relation between ‘slack’ and T being given by equation (2), lines 563-568 (this T and the maximum pay per worker ‘mu’ are input parameters). In the second part, they report numerical simulations which show that, for any two-stage mechanism that statistically improves the quality of the crowdsourced training labels relative to a single-stage mechanism, a higher test-set accuracy can be obtained after training an SVM (with either linear or RBF kernel) on those labels with a smaller number of workers than for the single-stage mechanism. Clarity - Justification: The paper is well written for the most part. I like the way the authors walk the reader through their thought process leading to the final algorithm, explaining the rationale for all the assumptions on the way: the trivial incentive compatible mechanism for the single-stage setting with no self-correction (Proposition 1), the impossibility of an incentive compatible mechanism with zero slack for T > 1/2 (Theorem 1), the infinitely many incentive compatible mechanisms with slack in the absence of further assumptions (Theorem 2), the two impossibility results for any smaller value of the slack (Theorem 3) and the stronger form of their no-free-lunch axiom (Theorem 6), and the main existence and uniqueness results (Theorems 4 and 5). However, I do have some questions and criticisms which I have merged with my other ‘detailed comments’ below. Significance - Justification: Crowdsourcing data labels for training is an emerging practice, and this paper appears to have taken the first step in designing a mechanism for this purpose that has some theoretical underpinnings, and is prima facie practicable. But see my comment (1) below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): (1) My main complaint is that there seems to be no theoretical guarantee or empirical evidence (see (2) below) that this two-step mechanism will indeed improve final test-set accuracy or even training label quality relative to a simple one-step approach with a trivial incentive compatible mechanism (although that is what one would expect, intuitively). It is not clear to me that the mechanism proposed for (fully or partially) rational workers, lines 310-320, actually 'mitigate[s] the many errors that occur due to silly mistakes or inadvertent errors' (lines 19-21). Or did I miss something? (2) Section 5 seems to be mostly unrelated to the rest of the paper, and the authors have not been clear about this. The mechanism proposed in Section 4 is not used for the simulations at all; the authors abstract away from all implementation details by simulating a worker population with error probability p in the first stage which reduces by q in the second stage; then they belabor the rather unsurprising observation that the final test-set accuracy is worse even with a higher number of workers, if we drop this self-correction stage, for various values of parameters p and q (that seem to be chosen arbitrarily with no connection to, say, the parameters of the proposed mechanism, although I must admit that a wide range of values of p and q is covered in Figures 3 and 4). The revelation that even 9 one-step workers are outperformed by 5 workers with self-correction is somewhat interesting but what might have made this section a bit more impressive is a (perhaps empirical) relation between the parameter values (p,q) and the minimum number of one-step workers required to beat a given number of workers with self-correction. Have the authors made any attempts in this vein? (3) It appears from the presentation that even for the gold standard questions, the workers’ first-step responses are compared with reference answers generated by the algorithm (from the crowd’s responses or a noisy ML algorithm as indicated in lines 147-151) instead of the known correct answers to those questions, otherwise the outcomes’ ‘-M’, +R’, and ‘-C’ would be impossible. Wouldn’t it make more sense to use the correct answers for comparison on these questions? Would there be any impossibility or triviality result for this setting, or do this paper’s results still hold in some way? The workers don’t know which questions are gold-standard, and their belief structure (Section 2.2) seems to abstract away from any assumption about how the ‘reference answers’ are generated anyway. (4) Although the paper says nothing about how to generate a usable ‘consensus’ label from the workers’ potentially diverse responses to the same query, I like the idea of a mechanism that results in a second-stage belief-threshold strictly greater than 50%: It would prevent an incorrect consensus due to beliefs being biased by the reference answer. However, under the proposed mechanism, a worker’s action for prior belief in the range [1/2-slack,1/2+slack] is left unspecified: Is such a worker indifferent between the two choices? (5) Related literature: It is not clear exactly how this payment mechanism “falls into the broad framework of strictly proper scoring rules” (lines 244-247) since the authors don’t seem to have borrowed from or contributed to classical scoring rule literature. Do they just mean that their payment rule maps a worker’s action and the state of the world (i.e. the correct answer), encoded in terms of +M,-M,…,-C, into real numbers, satisfying a _relaxed_ form of incentive compatibility? (6) Presentation: I think it would be better if, instead of listing the first and second stage requirements with bullet points (lines 343-363), they were enumerated and referenced with these numbers in lines 423-424. (7) A clarification question: Are there any restrictions on the choice of G, the number of gold-standard queries? =====