First of all, we would like to thank all reviewers for their very informative comments and suggestions. We will first respond to reviewer 4's comment, who decided to reject our paper due to an unfair reason, in our opinion, and we will then respond to the comments of reviewers 2 and 3.$
Regarding the comment of reviewer 4, we would like to mention that developing a model that describes the generative process through which the values of the variables we wish to infer, along with the observed variables, were generated, lies at the core of probabilistic graphical models. Thus, we do not consider that aspect of our writing misleading. A good reference describing the process through which such a model can be designed, is the Latent Dirichlet Allocation paper [Blei et al, 2003]. If our writing is indeed confusing, we would be happy to devote some space providing more intuition.

Reviewers 2 and 3 both commented on the fact that the DP extensions to our BEE model are confusing and not that helpful, in terms of empirical performance. We agree and we will be happy to move most of the content related to those extensions to the supplementary material and devote more space in the paper for the following, but we ask that the reviewers carefully reconsider here the new contributions.

	- A more detailed analysis of related work and practical significance of our method. We will also provide some more details on current state-of-the-art methods.
	- A more detailed description of BEE, of the assumptions it makes, and of its strengths and weaknesses. We will also add some comments on how one can make the error rates dependent on the input data features (as per reviewer 2's comments) and on why we intentionally chose not to do that (the main reason is so that our method can be used as a "black-box" approach in a wide variety of settings, such as crowdsourcing for example, without the need to modify the model to use the particular features available for arbitrary inputs). Furthermore, we will move section 3.4 to future work and provide some more details on how the full confusion matrix can be modeled using a model very similar to BEE. As per reviewer 2's comments, we will also provide a description of a setting in which BEE can fail. One such setting is when the outputs of all classifiers are very dependent on each other (i.e., in the worst case, all classifiers always output the same answer), and we can provide a short description as to why the models fail in that case.
	- A more detailed description of what we mean by "missing data" in section 3.1, as per reviewers 2 and 3 comments. It is true that those variables can be marginalized out pretty easily and thus we plan to add that in the paper (in fact marginalizing them out results in simply ignoring these values in the Gibbs sampling equations). Note that the practical significance of the missing data case lies in the fact that often, when multiple classifiers offer predictions for many labels, some of those classifiers might "abstain" from providing a prediction (due to not having features for the input data instance, for example -- in the case of noun phrases as inputs, such as the case with NELL, that is a frequent setting).
	- Reviewer 2 also mentions the identifiability issue, That is very briefly discussed in the bottom-left part of page 7 in our paper, in the justification for the error rates prior that we use. We will also add a couple more sentences in the methods section of our paper discussing this issue, as it is both interesting and important.

Reviewer 2 also comments on providing a more detailed analysis of our experimental results. The reason we did not provide a more detailed analysis was due to space constraints. We have all of the suggested results available (i.e., aside from the summary statistics provided in the paper) and originally we were planning to add those in a supporting website for the paper that will also include the code and the data needed to reproduce the results. We can add them as supplementary material if the reviewers and editors still think that this would be helpful. The short answer to the question by reviewer 2 on whether our methods provide systematic improvements across almost all classifiers is that they do.