Paper ID: 387
Title: Mixing Rates for the Alternating Gibbs Sampler over Restricted Boltzmann Machines and Friends

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper looks at mixing times for the alternating       Gibbs sampler 

Clarity - Justification: The paper is quite technical, but effort has been       made to simplify as far as possible

Significance - Justification: I recommend acceptance. I think the mixing time bounds       specifically for discrete RBMs aren't *that* novel as they are       basically special cases of known results (albeit for block       updates rather than random updates). Still the proof technique       is interesting, the generalizations to continuous spaces later       on are interesting, and the lower bounds are interesting. The       paper is well written. There are no experiments, but I see no       need for them. (ICML is approprate for new analysis of existing       algorithms, and I'm frankly glad to see work that doesn't see       the need to introduce a new variant.)

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):   I think that the specific results for RBMs in Section 3 can be seen as special cases of Liu 2014 "Projecting MRFs for fast mixing"  for random updates. Specifically, using Theorems 4 and 6 from that paper and applying them to RBMs seems to give the mixing rate for RBMs on Cor. 5, the mixing rate for deep RBMs on Cor. 6 and the mixing rate for softmax RBMs in Cor. 8. Still, there is novelty in novelty in considering block updates rather than random updates, and the proof technique in Theorem 3 based on two contractive block updates may be of independent interest. Still, it's worth asking 1) if the mixing times obtained by the proof technique in this submission can be extended to other matrix norms and 2) given that essentially the same mixing time bound holds for both random updates and block updates, why are block updates seemingly more effective in practice? A reader might get the impression that block updates have better bounds than random updates which isn't true (at this time).  In section 4, I found both the notion of a "gamble admissive" distribution and the idea of "coupling when Y is started from the stationary distribution and X is at most distance M from Y in expecation" explored on line 608-620 interesting and useful. For the latter, I would suggest giving this concept some kind of name, e.g. "mixing from stationary time" or "re-mixing time", and introducing a notation like ${\tilde tau}_M(\epsilon)$ to denote it. This would give a clearer contrast between the result in Thm 11 and a standard mixing time. With this done, it would be interesting to ask, can one have a markov chain which has a fast "re-mixing" time but a slow standard mixing time? The text on lines 608-613 suggest this is so, but I think this is worth formally stating as a Lemma, and also perhaps stating that a fast re-mixing time *does* imply a fast mixing time when the state space is bounded.  Is there any intuition for why the bound in Cor 13 depends on the Frobenius norm of W rather the 1-norm as in earlier results? I find this rather surprising, since the Frobenius norm will tend to grow as the dimensionality of the models grows even assuming constant parameters and sparsity, while the 1 norm will not. I was *quite* surprised to see Thm 16 which has a lower-bound on mixing time that establishes this is not just an artifact of the analysis.  Some discussion of how Thm 15 compares to the earlier upper-bounds would also be useful.  Overall, I thought this paper was interesting and (as far as I can tell) correct so I recommend acceptance.       

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides mixing rates for Gibbs samplers on Markov random fields (MRFs) that exhibit bipartite structure. Important examples of such models include various Boltzmann machines (BM) including restricted Boltzmann machines (RBMS), deep Boltzmann machines, Gaussian RBMs, and others. 

Clarity - Justification: The authors are careful to provide precise definitions for all the relevant functions and concepts. I would have given a higher score if the significance of the mixing rate results were more thoroughly discussed and perhaps compared with existing work (see below).  I also have a more specific question about the proof of Lemma 4 in the appendix. The first step is confusing (Line 1027), what rule is being applied? Shouldn't there be an inequality instead of an equality? I.e., that E|x-y| <= E(|x|) - E(|y|)? Does this inequality indeed hold?

Significance - Justification: Since alternating Gibbs samplers are the dominant inference method in RBMs, it is important to understand their mixing rates. In practice, these methods converge relatively fast, and indeed this may be why contrastive-divergence-1 works so well. These results demonstrate a type of uniform ergodicity in the sense that the mixing rate remarkably does not depend on the initial state.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The mixing rates are quite general and for a highly relevant class of MRFs. Most of the assumptions are reasonable, but not always. For example, in Corollary 8, it is unlikely that inequality would be satisfied as the LHS depends not only on the norms of the weights, but quadratically on K, which for the replicated soft max topic model K might exceed millions of vocabulary words when applied to say, Wikipedia.  The paper could be further strengthened by mentioning and comparing with existing work on analyzing convergence of MCMC algorithms. For example, it might be worth mentioning how the theoretical results relate to existing work on analyzing checkerboard Gibbs samplers in spin-glass models and more general chromatic samplers, both of which overlap substantially with the studied class of models+Gibbs samplers.  

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper establishes bounds on mixing times for two-stage Gibbs samplers on both discrete and continuous state spaces by using coupling constructions. The discrete case is quite straightforward and the main contributions of the paper are the results for the continuous case. Applications are for various Restricted Boltzmann Machines.

Clarity - Justification: The paper is easy to read but is lacking an in-depth discussion about the conditions of the presented theorems.

Significance - Justification: The results are quite incremental over existing convergence results for Gibbs samplers and they are also of limited applicability (essentially applying only to variants of RBMs)

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper investigates the mixing rates of Gibbs samplers for Restricted Boltzmann Machines and various extensions/modifications of the RBM. This analysis is based on coupling constructions.  Section 3 of the paper is devoted to discrete models. The results derived here are quite straightforward and obvious. The Gibbs kernel is assumed to contract wrt a semimetric, and the TV distance can then easily be bounded by a quantity \propto 1/ \gamma^{(min)}. The conditions of the theorem are then verified for the RBM which gives a contraction if |W|_1 *|W^t|_1 < 4. I wonder, however, how realistic this condition is for RBMs used in practice and a more in-depth discussion about this would have been a nice addition to the paper.  Section 4 continues with an analysis of continuous unbounded state spaces. The analysis is again based on coupling arguments under contraction assumptions, but the combination of different couplings used in Lemma 10 seems to be non-standard and is quite interesting! In my opinion, the contribution of the paper is therefore mainly Section 4 (worth to note is that this section is irrelevant for the standard RBM). Furthermore, as pointed out by the authors in Section 1.3, the continuous problem is not as well explored as the discrete one. However, Section 1.3 does not acknowledge any prior work on the analysis of Gibbs samplers for general state spaces, even though such works do exist; see for instance   Tan et al., 2013, On the Geometric Ergodicity of Two-Variable Gibbs Samplers. Wang and Wu, 2014, Convergence rate and concentration inequalities for Gibbs sampling in high dimension Singh et al., 2015, Blocking Strategies and Stability of Particle Gibbs Samplers  Finally, Sections 5 and 6, respectively, establishes lower bounds on the mixing rate and summarizes some results on the complexity of Gibbs sampling for RBMs under various conditions on the weight matrix. As mentioned above, I think that a deeper discussion about the conditions on W would be of interest, both to get a feeling for how realistic the condition on W is (Thm 17(ii)) and what we can expect from the Gibbs sampler when the condition is not satisfied.

=====