Paper ID: 733 Title: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is concerned with "Hogwild" style Gibbs sampling, modeling by saying that when variable xi is updated, it may have a "stale" version of its neighbors. The paper bounds both the bias (how much error this causes error) and separately mixing (how fast it converges to that (error-including) distribution) Clarity - Justification: A pretty technical paper, but well organized and written. Significance - Justification: The practical significance isn't so much demonstrated by the experiments, but I feel this kind of work is fundamental and important. There are some ideas that are of independent interest as well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I think this is good work on an important topic, and should be accepted. I have some disconnected thoughts: - Firstly, the idea of a sparse estimation time, and the fact that regular synchronous Gibbs achieves a O(n) rate on it rather than the O(n log n) rate comon with regular mixing times is independently interesting, and I think the paper should highlight this a bit further. - In first reading, I was hugely confused by Section 5 until I realized that $\bar \pi$ is the *biased* distribution, and that when comparing sequential gibbs to hogwild gibbs it will change. (At least I think this is true.) This should be emphasized a bit more for clarity. - On a related note, I wasn't initially sure what the value of section 5 is. After all, in practice, we care about getting something close to the true distribution. But it's actually quite surprising that Hogwild gibbs can have a bias and also *converge to that biased distribution more slowly*. - Also in section 5, if my understanding of $\bar \pi$ above is correct, this means it depends on the nature of the delays. Does this mean that you must assume that the delay conditions on the machine are stationary in order to derive any of these results? Isn't this very unrealistic in practice? (For example other jobs will start and end on the same machine.) - In the appendix, isn't Lemma 2 the definition of alpha, rather than a Lemma? - In the appendix, Lemma 5 is interesting in its own right. I find this to give a better insight into the "accuracy barrier" than Theorem 1. (Theorem 1 is strange since it doesn't blow up as epsilon approaches the "wall" of accuract as define on line 497. Does this mean there is looseness?) - On lines 788-795, should mention you can only do this for attractive models. - Very nitpicky, but on line 101, I didn't see why this would be surprising, as intuitively, surely using stale data to do sampling would create errors? - In Fig 5, I would like some more explanation of what multi-model gibbs is. - For the experiments, it might be worth making some more simulated experiments to explore variations in the distributions of delays. Overall, a good paper. The experimental results aren't hugely compelling and there are some places the analysis seems like it might be able to be tightened, but I think it is well into accept territory anyway. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents clear theoretical results on the (marginal) bias and mixing time of Hogwild-style Gibbs sampling for a class of discrete distributions. The analysis is very well presented and clearly motivated with good examples. My major concern is that the theoretical claims could have been better supported by more thorough experimental studies, which are fairly lacking currently (see detailed comments below). Clarity - Justification: Very good. I appreciate the way the authors present their theoretical analyses so that even though readers (like me) cannot verify details in the proofs they can still put together the results and grasp the big picture. Significance - Justification: The presented analyses clearly outlined two crucial properties for Hogwild-style Gibbs sampling on discrete graphical models, namely its bias and mixing time, and are thus of very good importance. The intuition of introducing sparse variation distance to measure marginal bias is especially interesting to me. However, one can possibly argue that in many cases (like the low-degree Ising model used in the experiment), these type of models naturally contain a fair amount of local conditional independencies that one could exploit to develop highly parallel Gibbs sampling algorithms already. And therefore the benefit of using Hogwild-style Gibbs sampling appears less intriguing. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. The counter-example in Sec. 5.1 seems to be of very high degree (O(N)), while both the theoretical result (Dobrushin's condition < 1) and the experiment (degree = O(1)) seem to imply a fairly low degree to be "necessary" for "fast" mixing, leaving a big gap in between? 2. The experimental study could have included results on the bias of the marginals, e.g. a graph showing the decay of the sparse variance distance with number of iterations, a similar graph as Fig. 4 for the bias, etc., to better support the analyses on bias. 3. How many threads are used in Fig. 4? 4. It might be helpful to show (even empirically) how changes in the number of threads would affect the delay parameter, which is arguably the only factor (?) in the theoretical results that reflects concurrency. For a while I found it a bit hard to believe that bias and mixing time should have nothing to do with the number of threads before I finally realized that. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This article studies the theoretical aspects of the asynchronous Gibbs sampler which is also known as the HOGWILD!-Gibbs sampler. The authors theoretically analyze the estimation bias and the bad mixing time caused by the asynchronous updates. For bias, the analysis illustrates the difficulty of HOGWILD!-Gibbs to converge in total variational distance and defines a new sparse total variational distance to characterize the asymptotic behavior. For mixing time, the analysis shows that the growing rate could be as worse as exponential. To a positive end, the article proves that if the model satisfies the Dobrushin’s condition, the HOGWILD!-Gibbs sampler can achieve a low estimation bias with fast mixing rate. Experiments are provide to verify the theoretical claims. Clarity - Justification: The presentation is clear and concise. Significance - Justification: The theoretical analysis of asynchronous Gibbs sampling is important as asynchronous sampling/optimization algorithms gain vast popularities in recent years. Such analysis sheds light upon the theoretical limit of the algorithm, helps people gain understanding of the applicable range and prevents possible abusing beyond the algorithm's ability. I'm quite interested in seeing how the naive asynchronous update can be modified, according to the current work, to gain a better mixing rate beyond the Dobrushin’s condition. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is an interesting work backed up by solid theoretical analysis. The results enrich people's understanding on a popular scalable algorithm and give insights onto possible future improvements. =====