Paper ID: 328 Title: Community Recovery in Graphs with Locality Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers a version of the community detection problem where pairwise measurements are available only on the neighborhood of each node defined by an underlying graph. Sharp recovery boundaries are derived for the number of measurements required for exact cluster recovery, for certain types of graphs and provided that the neighborhood is sufficiently large. The algorithm consist of spectral clustering of a small subset of nodes followed by successive propagation of the cluster labels via majority voting. Clarity - Justification: The paper is in general easy to follow, with the key ideas clearly presented. I do hope the authors are a bit more explicit with constants and asymptotic arguments, e.g. in phases like "with high probability", "vanishingly small", as the proof is a combination of asymptotic and non-asymptotic analysis, and the paper does care about constants. The supplementary material seems to be prepared in a hurry. There are various cross-referencing errors, an editing comment at the beginning of Sec 4.3.1, and sentence like "Z is independent of Z” in Remark 3. Significance - Justification: While the algorithm and analysis is relatively standard, I do like the problem setup and models in this paper, which departs from the usual uniform sampling model. The results show that as long as the underlying graph is sufficiently connected and a small number of nodes can be accurately clustered, then the information can be propagated and the sample complexity is largely independent of the graph topology (at least for certain types of graphs). I think this is an interesting observation and useful first step to more general problems. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The results in this paper are valid in the regime where the connectivity radius of the underlying graph is sufficiently large: r > log^3 n. With such dense graphs, the topology/locality constraint essentially plays no role in determining the hardness cluster recovery (besides some boundary effects), especially for the simple types of graph considered here. It may be more interesting to consider sparser and more complicated graphs, and study the effect of graph topology in a more general setting. The algorithm relies on an pre-specified ordering of the nodes, and it is not quite clear to me how this order is defined for the grid graphs (or other more general graphs). The recovering procedure and analysis seem to be very much tied to such structures, and I am not sure how they can be extended to more general settings. Even restricted to the grid and small world graphs considered here, how are the sets V_0 and V_i defined? And what happens if the graph topology is unknown (i.e., one knows only those i,j with N_{ij} >0 but not the full underlying graph.) The analysis of stage 2 of the algorithm seems to be a key of the proof, and it would benefit from some improvements in the presentation. For example, the definition of the set S_u^good precludes nodes in the same group V_{i+1}. As far as I understand, this seems to be important in the subsequent analysis that uses conditional independence across V_{i+1}. It would be helpful to elaborate on key steps like this, as well as what is the event that is conditioned on. Finally, the entire analysis is based on the boundary case with m ~ n log(n). I am not completely convinced that it can be easily extended to the general case. For example, the analysis relies on the results from Chin et al, which in its current form is valid only in the boundary case. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider community recovery in graphs with locality constraints. That is, the goal is to recover the community structure of a graph when (noisy) "co-habitation" information is constrained by some locality constraints on the vertices. This setting seems to be motivated mainly by the haplotype phasing problem. The authors demonstrate a natural algorithm for exact recovery and show theoretical guarantees for specific types of locality constraints. Clarity - Justification: The paper is written very clearly. I would however recommend at least a short paragraph in the introduction sketching out the haplotype phasing problem to motivate the problem early on. Significance - Justification: The problem considered is quite interesting and the algorithm is natural in this setting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors study an interesting and seemingly well motivated problem. The paper is very well written. I recommend that it be accepted to ICML. I somewhat strongly urge the authors to include brief proof sketches in the paper in the spirit of making it as self-contained as possible. Also, it will be interesting to see discussion about future work/open problems spelt out since the setting seems quite specialized. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies community recovery for a random graph, where each pair of nodes is associated with an independent Poisson random variable. The special setting is that not all edges are observed. Under a special two-class symmetric model, the author(s) extended the existing theory of sample complexity to this incomplete observation model. Clarity - Justification: The paper is well organized, with clear notation. Significance - Justification: Although the setting is somewhat new, the mains results are largely based on known results for fully observed graph, with routine modifications of previous arguments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): On the theoretical side, the author should emphasize/clarify the connection/difference of the technics used in this paper and in previous works on fully observed graphs. At this point it is hard for me to see technical novelty. The author(s) did mention haplotype phasing problem as a motivation, which seems interesting. However, in the numerical experiment section only synthetic data is considered. It would probably be more challenging to consider settings where the locality constraint does not allow any core set of fully observed nodes. =====