Paper ID: 691 Title: Estimating Accuracy from Unlabeled Data: A Bayesian Approach Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a simple Bayesian model for estimating both the error rates of multiple individual classifiers and the true labels, for data where there are no true labels available. The simple model is then extended to allow for sharing of information across domains and clustering of classifiers using non-parametric Bayesian extensions of the simple model. Experiments on two different domains (text and brain imaging data) demonstrate that the proposed approach is more accurate than existing methods - the experiments also show that the non-parametric extensions do not yield much improvement in accuracy over the relatively simple mode. Clarity - Justification: (I really wanted to use the label "average" here, but that wasn't an option). The paper is generally clear, but there is room for improvement. For example, at the end of Section 3.1, "missing data" is emphasized, but there is no clear definition here of what precisely is missing. The paper mentions that the missing data can be treated as a latent variable during Gibbs sampling but no details are provided - this paragraph is confusing. The introduction and discussion of related work is a bit terse (partly due of course to the space limitations). I would recommend rewriting this and trying to provide a bit more context and explanation to a reader who is reading this in 10 or 15 years time from now. Section 3.4 should be in future work, not where it is currently The DP description could be a bit more precise, e.g., define clearly what you mean by "atoms" in this context Significance - Justification: I like the overall direction of the paper and the results from the simple model (BEE) are surprisingly good. However, I came away from the paper wishing that the paper had spent more time exploring the strengths and weaknesses of BEE in particular, and less time on developing the non-parametric framework (which ended up not helping much empirically). The key message seems to be that the simple BEE model does a good job on the 2 domains that were explored here. This is interesting and worth more exploration in depth. For example, I would like to see a more detailed discussion of the assumptions that are implicit in the graphical model in Figure 1. We could for example make the e_j's dependent on either X_i (in which case there would be e_{ij}'s, with e_j perhaps as a prior) and/or dependent on p itself (if p is very close to 1 we might expect that the e_j's tend to be low - perhaps). I don't know that either of these types of dependencies are even remotely useful but the point is that as a reader I would like to see more discussion of the strengths (simplicitly, ..) and limitations (possible useful dependencies not being modeled...) in Figure 1. It would also be good to see some discussion of identifiability here if that could be done - one worries that for some of these problems that the e_j's and l_i's might not be identifiable. Perhaps a starting point here might be to take a devil's advocate position and create a synthetic example of where the model proposed in the paper would fail - is there such an example? The idea here is to provide the reader with a bit more of an in-depth discussion of the strengths and weaknesses of the model presented in figure 1 (and its cousins in later figures). In general I found the DP part of the paper to be a bit of a distraction to the main agenda - it seems "bolted on" to a certain extent. It wasn't entirely clear to me how much benefit (if any) it was providing - the error rates are reported on a log-scale in 3 of the 4 plots, so in cases where DP does improve over BEE it seems like the difference might be small in an absolute sense. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In addition to the earlier comments above, as a reader I would like to have seen the experimental results broken out in different ways. In the paper we just get to see summary statistics of error rates (MSE and MAD). It would be informative to see some other ways to look at this data in more detail, such as getting some information about what the actual true error rates are for the classifiers (e.g., via boxplots) - this would help provide some useful additional context for the experiments. It would also be helpful to perhaps look at how correlated the estimates are for the proposed methods versus say the cBCC method, on a classifier by classifier basis (e.g., via a scatter plot) - are the proposed methods providing systematic improvements across all (or almost all) classifiers, or are they doing much better on a subset of classifiers and about the same on others? And does this analysis provide any insight into how the methods differ? These are just some suggestions - it seems like there are a lot of other aspects of the experiments that could be investigated more deeply, with the extra needed space coming from perhaps putting less attention on the DP methods, or providing supplementary material, etc. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the problem of estimating the accuracy of a collection of classifiers without labeled data (or a "gold standard" classifier). It develops a simple model based on a shared error rate prior, extends it to the multi-output case with a Dirichlet process prior on clusterings of outputs, and further extends it with an additional Dirichlet process layer to clusterings of classifiers. Experiments are provided comparing the proposed methods to past work on natural language and brain fMRI scan datasets. Clarity - Justification: I don't have very much to say here -- the paper was generally well-written and easy to follow. My only concern is that the sections describing the DP and HDP extension of the basic model are somewhat unclear. For example, the graphical model for the BEE is helpful, but beyond that they just become confusing and unhelpful. In my experience, graphical models are only useful when there is a simple/interesting/obvious structure in the visual representation that would not be seen from a mathematical construction. For CBEE and HCBEE, this doesn't appear to be the case. Significance - Justification: I am not an expert in the field of classification, nor in the unsupervised comparison of classifiers. Whether the proposed models themselves are incredibly novel or not within the literature of these fields, I am not qualified to say. Given my particular expertise, the models *seem* very straightforward. The experiments, on the other hand, carry the paper regardless of the above sentiment. They are complete and well-written. The proposed methods are compared to a smorgasbord of state-of-the-art algorithms. The results are strong, and justify the methods (or at the very least, the BEE method -- it might be argued that the improvements of CBEE and HCBEE don't justify the complexity of these approaches). I believe this paper is of significance based on the merits of its experimental results alone; I will defer to other reviewers on the novelty of the proposed models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): -- p2L: "independence given labels" -- independence of what? The classifier outputs? Also, what does "assuming the true distribution of labels" mean? -- p2L: The paper does not state any drawbacks of the Moreno (2015) and Tian & Zhu (2015) references in the related work. What are their shortcomings that are addressed in the present work? -- p2L: The first sentence in sec 3 is verging on run-on -- p3L: In most Bayesian models, missing observations can be marginalized easily. Why is it in your model that you need to explicitly represent them as latent variables if they're missing? For that matter, why is it hard to include missing data in past work? -- p3L: I'd remove the "implicit use of agreement rates" box; it's a little unprofessional -- p6L: Never thought I'd see a citation of Harry Potter in an ICML paper. Points for style! ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper aims to propose a method that uses unlabeled data for accuracy estimation of classifiers and for classifier output combination.The methods are based on a Bayesian methodology. Clarity - Justification: My main issue with the paper as with clarity. I could not understand the main methodology of the paper even though I tried multiple times. My understanding of the proposed Bayesian models (a methodology with which I am deeply familiar) was that the proposed methods generate the data and its labels. I could not understand how this can be used for the evaluation of the output of a classifier that already produced labels for the data. I am happy to resonsider my score if the authors explain their high level idea in a way that can enable me understand the paper after reading it once again. In any case, I think the paper should be substantially revised so that it is easier to understand. Significance - Justification: see above my comment on clarity Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): see above my comment on clarity ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a natural Bayesian model to estimate accuracies (error rates) of a family of pre-trained classifiers using (only) unsupervised data. Estimating accuracies is important particularly for transfer learning and to some extent domain transfer. The model is extended to cluster accuracies over domains and classifiers using a non-parametric approach. Gibbs sampling is used to estimate the accuracies of the classifiers; they outperform the SoA on estimating these accuracies on two realistic datasets as measured using MSE. The model also results in better labeling (i.e. model combination). Clarity - Justification: Overall the paper was quite clear, though there are some avenues for improvement: * The introduction could be worded better, for example lines 64-69 use an excessive number of abstract references ("that", "this"). * In line 75-78, this wasn't * The details of the individual models are fairly clear. * The datasets the results are reported on could be made much more apparent in Figure 4, 5. * Finally, there are a lot of little details that are missing mention in the evaluation section. Please see the detailed comments below for an elaboration. Significance - Justification: While the problem setup has been presented before (e.g. without a Bayesian cap in Platanios et al 2014), the paper presents a new Bayesian take on the problem of estimating accuracies. This new method is significantly better than prior work empirically. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * How were the separate classifiers trained in the experiments? Were they trained for each domain separately, or is it a single model that is used across domains? ** In particular, it seems that if the classifiers were trained on label data they could work well and thus have high enough agreement that computing accuracy should be very easy. What was the range of per-classifier accuracy obtained? ** One of the chief motivating applications described in the introduction was to be able to fully unsupervised systems, for which there is not a lot of labelled data to train individual classifiers. How would this model help in such situations? * How strong is the independence assumption used in the models? One of the key points of Platanios et al 2014 seems to be that making the assumption that classifiers are independent can significantly hurt performance. How is your model able to avoid this pitfall? * A (mostly valid) selling proposition of this paper is to help in transfer learning. I think this an extremely important application, but I am not fully convinced that the datasets showcase transfer learning. How (dis)similar are the "domains" as defined in experiments? ** I'd be particularly impressed with an evaluation on a sentiment dataset, e.g. Wachsmuth, H., Kiesel, J., & Stein, B. (2015). Sentiment Flow – A General Model of Web Review Argumentation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 601–611. Retrieved from http://www.uni-weimar.de/medien/webis/publications/papers/stein_2015n.pdf * What are the failure modes for the model? I can foresee two issues: the independence condition (since Bayesian models typically fail when the model is misspecified) and if there is low accuracy on the classifiers. * What is the actual standard deviation (see lines 677 and 693) -- I think it's sufficient to just put this in the figure/text. * Are there particular domains/classifiers that the model does poorly on (essentially, what information is being lost due to the aggregation across domains presented in Figures 4/5. * Is the 10% of the data mentioned in 712 the same as what was used for Figures 4/5? * The experiments are otherwise very exciting. I am a little bit concerned about why there isn't more significant of a difference between the methods on all of the Brain dataset given it has an order of magnitude less data than the NELL dataset (I guess this is because it has far fewer features and is an "easier" task?). * As an aside, I think it would be very exciting to apply your model to the problem of calibration which is potentially even more crucial to building autonomous systems. * I would be willing to revise my rating to a strong accept if the above details are clarified in the rebuttal. =====