We thank all the reviewers for their time and effort in reading the paper and providing valuable feedback. In what follows, we address specific questions.$--------------- Assigned_Reviewer_2 (1 & 2) To clarify, the claim that the proposed mechanisms can mitigate silly errors is based on its theoretical guarantee of incentive compatibility which incentivizes workers to fix their mistakes when they realize them, and prior empirical evidence in psychology showing that simple reminders to check for possible errors can significantly increase chances that people realize their own mistakes. Section 5 investigates the tradeoff between having a second stage (which requires additional effort from each worker) versus having one stage but with additional workers. This section assumes that mechanisms for both one and two stage settings are already in place. We do not know the values of (p,q) in practice (or how they vary across applications and populations), and consequently, we run simulations for a very wide range of values of these parameters, where we see that the two stage setting consistently yields a superior label quality as compared to the one stage setting. Acting on your excellent suggestion, we have now theoretically established a relation between (p,q) and number of required workers, and we will include this analysis in the revision (These analytical findings closely match the simulation results.) To summarize, we posit the contributions of the paper as proposing a new concept for crowdsourcing platforms based on prior empirical evidence from psychology, deriving mechanisms with rigorous theoretical guarantees, and falling back to extensive numerical simulations to compare the final accuracy of labels obtained from the proposed two-stage setup with that from the standard one-stage setting. (3) One could compare worker's answers with the gold standard, but we consider a separate reference since our goal is not to infer the correct answers to gold standards (which we already know) from the workers. The theory holds if gold standards are used as reference when G < N, since as you point out, the worker does not know which questions are gold (4) In the range near 1/2, the worker may not be indifferent and may be incentivized to take one or the other action. The impossibility result in the paper shows that no mechanism which can control the incentives of the worker within this range near 1/2. (5) Yes, we mean that. The proposed mechanism is a scoring rule for the proposed two-stage setup. (7) The only constraint is that G >= 1 --------- Assigned_Reviewer_3 We thank the reviewer for the encouraging feedback. --------- Assigned_Reviewer_4 - The results and incentive compatibility hold irrespective of the choice of reference. We will put clarifying details in the revision as per your suggestion - The bias in the second stage results from the intuitive and well-studied fact in psychology that it is easy for a worker to copy the reference and consequently workers are biased towards this easy action (see Muchnik et al. 2013; Clayton 1997; Rowe & Wright 1999; Hasson et al. 2000) - As you mention, we evaluate worker's performance only on the gold standard questions. We emphasize that this approach is standard in practice: All mechanisms (that use gold standard questions) on all crowdsourcing platforms evaluate only on gold standards since answers to only these questions can be verified with certainty to make the payment. - Bias removal techniques are orthogonal to our work, and our work can complement them nicely. In more detail, our work incentivizes workers to respond in a certain manner **during** the data collection phase of crowdsouring, while bias removal techniques are used to **post-process** the obtained data. Indeed, the post processing techniques of Wauthier et al. and many others can be used to post process the data obtained from the proposed two-stage setup and draw informative inferences. Papers on bias removal using statistical inference & machine learning techniques (many papers in recent ICML/NIPS/KDD/WWW/HCOMP, including Wauthier et al.) all show that while bias can be mitigated, an intrinsically **lower bias leads to significantly superior results**. Our setting and mechanism help in achieving exactly this goal of a lower bias. - We run simulations for a very wide range of behavioral parameters in order to evaluate the effects of a variety of different populations. We see that our two stage setting consistently yields a superior label quality as compared to the one stage setting. Indeed, we posit our paper as one that proposes a new concept for crowdsourcing platforms based on prior empirical evidence from psychology, derives mechanisms with rigorous theoretical guarantees, and falls back to extensive numerical simulations to compare the final accuracy of labels obtained from our proposed setup with that from the standard one-stage setting.