Paper ID: 272 Title: Optimality of Belief Propagation for Crowdsourced Classification Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper deals with the problem of inferring true class labels from noisy crowd-sourced labels. The main contribution of the paper is to formally prove that (loopy) Belief Propagation is optimal for this inference task, in the limit where the number of tasks and workers go to infinity (Thm 1). The main limitation is that the result only applies to graphs where each task is assigned to either 1 or 2 randomly chosen workers. If >2 workers are assigned to a task, the result does not hold, but the authors prove that BP is still guaranteed to outperform two competing methods, namely MV and KOS (Thm 2). The main idea of their proofs is to show that under these assumptions the underlying graphical model is tree-like, and therefore LBP will provide accurate answers. Clarity - Justification: The paper is well structured and written. It was easy to read and the authors did a good job providing the intuition behind their formal results. Significance - Justification: I found the results interesting, as they provide a nice theoretical guarantee for loopy BP on a real-world inference problem. The assumptions made however are somewhat unrealistic (each task assigned to 2 randomly chosen workers). Another limitation is that the paper does not really introduce any new algorithm, or provide new insights on how to develop better techniques. The experimental part doesn't really add much to the paper. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): A few more comments on the exposition: 122, what does "exactly optimal" mean? a figure showing the graphs introduced around line 530 to 545 would be helpful 588, to go from (14) to (15), are you assuming that if it's not a tree, BP has a worst-case probability of error =1? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers the Dawis-Skene model for crowdsourced classification and as its main result it shows that when every worker has 2 assigned tasks, and every task is treated by sufficient number of workers, the belief propagation algorithm for this problem in optimal. This is an important and mathematically nice contribution, adding this popular and important problem among other models where BP is optimal in some regimes. What concerns discussion of the regimes that are not included in the proof, e.g. larger r or smaller l, the paper is evidence for optimality of BP is relatively weak and I think the authors should weaken their conclusions. Clarity - Justification: The main result related to Theorem 1 is very clear. What I find misleading/not clear are the parts where BP performance out of the assumptions of Theorem 1 is discussed. The evidence for those claims is based on simulation on systems with only 200 tasks, this is totally inconlusive. I do understand that rigorous extension of Theorem 1 is difficult. There exist well established heuristic arguments that lead to much more solid conjectures of optimality (or suboptimality) of BP in related models, such as e.g. the stochastic block model Decelle et all, Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Significance - Justification: Theorem 1 on its own plus the algorithm and its promissing performance on real datasets are significant. (Although the sizes of datasets on which the authors demonstrate the algorithm are ridiculously small, given the time complexity is only n*l*r cannot they take millions of nodes instead of few hundreds?). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): ** When the authors state their main result on page two, it would be more than fair to add the restriction on parameters under which the result holds. ** The authors only considered the labels to be +1 or -1 with equal probability. Why is this without loss of generality (as claimed). If the -1 were very rare (for instance) would not that lead to different performance? ** The technique developed in this paper is indeed interesting for the censored block model, but not for the regime of exact recovery as considered in the cited reference. It would be interesting for the low snr regime as considered by Saade et all in Spectral detection in the censored block model. ** In the discussion of r greater than 2 in section 3.1 the authors refer to the stochastic block model with more than two groups. Whereas I agree that mathematically the challenges there are related. The SBM with more than 3 groups is well heuristically understood and opposite to authors claims for the present model, in SBM there provably is a gap between the performance of BP and the information theoretically optimal performance for larger number of groups (this is studied in detail in the above mentioned Decelle et all paper, and recently proven by Banks and Moore, Information-theoretic thresholds for community detection in sparse networks). ** Concerning the numerical tests. What prevents the authors of taking more realistic system sizes? 200 tasks is really tiny. ** The choice of oracle for BP as done in the present paper is not very standard. The usual criterium is the existence of another BP fixed point when the BP is initialized according to the ground-truth labeling (see Decelle et all for this in the SBM). If the BP-true and this oracle is run on much larger system sizes and wide range of parameters (typically gaps appear at large r in regions on small but non-zero error rate) this would provide more solid evidence of optimality of BP. ** In conclusions: "we settle the question of optimality and computational gap ...." this is overoptimistic. They do settle this question only in a rather narrow range of parameters. This is interesting, no need of possibly misleading claims. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The crowdsource classification is the problem of inferring a label based on tagging of $l$ workers. The paper focus on the problem where each worker has fixed unknown credibility (for all $r$ assign tasks) p_u. BP is suggested as the inference algorithm, and theoretical guarantees is presented that show that BP out-preform majority vote and KOS. Synthetic and real world experiments verify that BP out preform the other two methods. Clarity - Justification: The writing is good, though its is not clear from reading from where part of the condition came from. Where do you use lemma 3? Significance - Justification: The theorems presented are another interesting example where BP perform well. However all the proof are in the boundary when n goto infinity - no bounds on finite n is given. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The graphical model that is used must be clear - an image will be great. I did not understand what you mean by BP vs practical BP - if you know p_u for each worker the graph become a forest where each s_i is connected to A_iu and they are connected to the seen workers p_u (where is my mistake). This follow to the use of EBP. How do you choose $k$? please explain the trade-off between more accurate assessment of p_u while you may include cycles that will degenerate BP performance. MAR, is usually used when looking for the maximum assignment for all variables together( max p(s|A)), here the marginal is needed - taking the maximum of p(s_i|A) for each s_i separately. In the experiments plots you show EM with no reference in the writing. =====