Paper ID: 754
Title: On collapsed representation of hierarchical Completely Random Measures
Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a formulation of hierarchical completely random measures as priors for Poisson proceeses, and describes a collapsed sampler akin to the Chinese restaurant franchise technique. The sampler is applied to topic modeling using a generalized Gamma process-based hierarchy. Experimental results are provided for the NIPS dataset. Though the paper is in general well written, there are multiple issues with the technical content and the experimental section, as detailed below. Overall, I consider this paper in its present state to be well below the acceptance threshold.
Clarity - Justification: The paper does a good job of discussing the salient points of the work. There are some minor typographic errors, as mentioned below, but they do not detract from the overall presentation of the material.
Significance - Justification: The paper proposes a hierarchical CRM formulation by trying to marginalize out the intermediate measures, but the authors do not include the important parts of the proof in the main paper. The correctness of the CRF-inspired sampling technique is not proven, and the experimental section is extremely short and nondescript. The perplexity numbers seem to be well below the state of the art, even for LDA and HDP. I do not think the paper in its present state will be of sufficient interest to the community. Update ------- The rebuttal has addressed the main issues I had with the paper, so I am upgrading the significance to above average.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - The main theoretical arguments in the paper are contained in Theorem 2, and Proposition 1. Aside from the numbering anomalies, I wonder why these important sections were relegated to the supplementary in spite of there being enough space in the main paper. - In the proof of Theorem 2, how do the authors arrive at the expression on lines 842-844? In particular, how does the conditional probability of the N_{i}s decompose into the two product terms? - In the proof of Theorem 2, I am not convinced of the correctness of the derivation of the distribution of the \[m_{ij}\]s from that of the N_{i}s by simply adding up the probabilities. The CDF technique would be better suited for this step. - The Chinese restaurant process/franchise techniques arise from collapsing out certain sampling steps for DP / HDP mixtures. Now it is well known that the DP is not a CRM, but a normalized one. Can the authors' sampler be similarly derived? This question is pertinent to the correctness of the proposed sampler. - The experimental section is very short and uninformative. Why weren't more datasets used, and more algorithms compared to? - The perplexity numbers for the NIPS dataset are worse than even the earliest ones reported in the literature using LDA/HDP (for e.g. Asuncion et. al., NIPS 2008). Yet the authors claim on lines 653-658 that the models are identical. This discrepancy leads me to question the correctness of the sampler. Minor comments -------------- - Typographic errors: -- line numbers 256, 264. - Other writing anomalies: -- empty bullet on line 135.
=====
Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper derives a collapsed Gibbs sampler, related to the Chinese restaurant franchise sampling scheme, for mixed-membership models built on hierarchical CRM-CRM-Poisson processes. The paper appears to be solid in theories but a little bit weak in experiments.
Clarity - Justification: The paper provide sufficient details for all derivations. It could be further improved by explicitly writing out the sampling equations for the gamma process, generalized gamma process, and sum-generalized gamma process. There is not enough analysis on experimental results.
Significance - Justification: While some of the developments are closely related previous work, Lemma 2 and the collapsed Gibbs sampler described in Equation (11)-(14) appear to be novel. The SGGM (sum of generalized gamma process) also seems interesting, given its superior performance in comparison to the GGM (generalized gamma process).
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Directly marginalizing out both $G$ and $Lambda_i$ from a CRM-CRM-Poisson process X_i~PP(Lambda_i), Lambda_i~CRM(\rho_prime, G), G~CRM(\rho,G_0) may not be easy, except for some special cases (e.g., Lambda_i are drawn from gamma processes, and G is drawn from the beta process or gamma process). To address that challenge, the key idea of the paper is to re-express a CRM mixed Poisson process X_i~PP(Lambda_i), Lambda_i~CRM(\rho^prime, G) as a compound Poisson process X_i ~ Compound_Poisson(L_i, \rho^prime), L_i~PP[G f(\rho^prime)] and then marginalize out the globally shared CRM G~CRM(\rho,G_0) from $n$ conditionally i.i.d. Poisson process L_i~PP(G f(\rho^prime)), i=1,...,n To derive a sampling algorithm that does not require truncation, the paper derives a collapsed Gibbs sampler that mimics the Chinese restaurant franchise sampling scheme. While some of the developments are closely related to those in [Zhou and Carin, 2015], [Zhou, 2013], and [Zhou, Padilla & Scott, 2015], as commented below, Lemma 2 and the collapsed Gibbs sampler described in Equation (11)-(14) appear to be novel. The SGGM (sum of generalized gamma process) also seems interesting, given its superior performance in comparison to the GGM (generalized gamma process). Section 2 on background information is too long for a conference paper. Space could be saved to expand the experimental results section, which does not seem to provide enough insights on why the GGM-gamma-Poisson process and SGGM-gamma-Poisson process outperform the simpler gamma-gamma-Poisson process. A possible explanation could be that while the number of topics of the gamma-gamma-Poisson process (gamma-negative binomial process) increases at a logarithmic rate, as shown in [Zhou, Padilla & Scott, 2015], the GGM and SGGM allows the number of inferred topics to increase at a power-law rate as a function of the number of documents. If this is the case, the authors may want to plot the number of inferred topics of different algorithms under various settings. If the power-law behavior is not the only reason for better performance, the authors may want to explain more about how and why a single additional parameter brought by the GGM (or multiple additional parameters brought by the SGGM) help to clearly improve the performance. The discount parameter $d\in[0,1)$ is fixed for the GGM. It is unclear why a Metropolis-Hasting or greedy-gibbs algorithm to sample this univariate parameter has not bee considered. If $d$ is inferred from the data, how would its posterior distribution looks like? The authors set $\rho^prime(dz) = z^{-1}e^{-z}dz$ to be the same for all object specific CRMs $Lambda_i$. It was shown in [Zhou and Carin, 2015] that choosing $\rho^prime_j(dz) = z^{-1}e^{-(1-p_j)/p_j z}dz, p_j~Beta(a_0,b_0)$ to be object specific leads to clearly improved results in comparison to fixing $p_j=0.5$, i.e., letting $\rho^prime_j(dz) = z^{-1}e^{-z}dz$. It would be interesting to find out whether the GGP and SGGP would still outperform the gamma process if considering this simple modification. Both Equations (5) and (6) need to be corrected by either dividing their right hand sides by $k!$ or referring to their left hand sides as a collection of unordered count vectors. The reason is because the distribution of a collection of a random number of unordered count vectors and that of a random count matrix are different by the combinatorial coefficient $k!$. See Section 2.1.2 of [Zhou, Padilla & Scott, 2015] for related combinatorial argument. Comments about two related papers: [Zhou, Padilla & Scott, 2015] Mingyuan Zhou, Oscar Hernan Madrid Padilla, and James G. Scott. "Priors for random count matrices derived from a family of negative binomial processes." To appear in Journal of the American Statistical Association, 2015. [Zhou, 2013] Zhou, Mingyuan. "Generalized negative binomial processes and the representation of cluster structures." arXiv preprint arXiv:1310.1800 (2013). Theorem 2 is closely related to the derivation of the negative binomial process (gamma-Poisson process) random count matrix prior, as shown in Appendix A.1 of [Zhou, Padilla & Scott, 2015]. When n=1, Theorem 2 is the same as Theorem 3 of [Zhou, 2013] up to a combinatorial coefficient. Theorem 1 is closely related to Theorem 1 of [Zhou, 2013]. Corollary 1 and Proposition 1 are related to the discussion of [Zhou, Padilla & Scott, 2015] on how to generate a negative binomial process random count matrix: first generating a Poisson random number of nonzero columns and then generating i.i.d. column count vectors. Proposition 1 is also related to Theorem 2 of [Zhou, 2013]. Equations (7) and (8) are also closely related to the gamma-negative binomial process (gamma-gamma-Poisson process) described in detail in Appendix B.1 of [Zhou, Padilla & Scott, 2015]. Other minor issues: The authors comment in Line 113 that “One should note that directly defining a prior on the power term in HPYP, gives absurd results.” Is there any justification for this comment? Line 135: extra bullet Line 264: for with => for Line 606: 0.2 to 0.7 => 0.3 to 0.7
=====
Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies properties of the distribution of a finite collection of Poisson process with mean measures modelled through hierachical completely random measures (HCRMs). In particular, the authors derive the marginal distribution of the Poisson processes by integrating out the HCRMs. This result is then used to develop a generalised Chinese restaurant franchise (CRF) scheme to simulate from the HCRM-Poisson process model. This general class of models can be used for topic modelling, providing a rich class of priors beyond the hierarchal Dirichlet process. Importantly, via the generalised CRF developed in the paper, inference is possible without having to resort to truncation. An example on the NIPS dataset shows the advantage of models beyond the HDP.
Clarity - Justification: The article is well written. There is a lot of notation, and it can be difficult to keep up at times, but it is necessary.
Significance - Justification: The generalised CRF developed here can be used for inference in models beyond the HDP in topic modelling, without resorting to truncation methods. The authors showed that a models beyond the HDP can perform better (in terms of perplexity on a test set) in the NIPS dataset. The authors suggest that the results could be applicable also in hidden Markov models.
Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper was well written and a novel and interesting contribution to literature. It would be nice to have motivations for the more complex priors over the HDP based on properties of the apriori feature allocation structure (ex. power law behaviour), than just empirical results. In addition, I think the authors should acknowledge any limitations of the model; ex. does it include the hierarchal PY - which cannot be represented as a normalised CRM? Other comments: 1. line 114: can you provide a reference with this note? 2. line 135: empty bullet. 3. line 169: with probability one 4. line 418: introduce notation CCRM 5. line 539: define T^{-(il)} 6. References: check capitalisation in titles and use Kingman, John FC in both references.
=====