Common concerns$-------
The experimental section of this paper is indeed short. However, the intention of the paper was not to present a better performing mixed membership model, but to provide a theoretical framework for modelling with hierarchical CRMs. Different choices for the base and object specific CRMs will result in different sampling equations, and thereby different perplexity results. We intend to perform a thorough study of how the choice of Poisson intensity measure affects performance in the future.
We tested with a uniform prior on the discount parameter but found that the resultant model to be quite unstable. We will include the results in the final draft with non-uniform beta distributions for the discount parameter.
As suggested, we will add the results for $p_j!=.5$ in the final draft. We aren't aware how the resultant model will compare against the models that we have experimented with.
The minor corrections pointed out by the reviewers will be incorporated in the final draft.
Assigned Reviewer 1
-----
Thanks for the detailed positive review.
As suggested, the sampling schemes for various CRMs will be added in the final draft. The dish sampling equations in the CRF scheme for SGGP (and hence GGP, which is a special case) are already provided in the experiments. Except for the case of sampling a new table, the equations for dish and table sampling for gamma-gamma CRM will be exactly the same as HDP.
GGMs and SGGMs can model the power law growth of the number of topics. The improved performance of these models suggest that the number of topics in NIPS corpus follow power-law. Indeed the number of topics discovered by GGM was as high as 130 as compared to only 45-50 topics discovered by Gamma process. A graph of the same with respect to training percentage and for various choices of the discount parameter will be added in the final draft.
Thanks for pointing out the missing $k!$ term.
Thanks for pointing out the related works. We weren't aware of the JASA submission. The equations match up to a combinatorial constant for the case of gamma-Poisson process. We will add the citations to the mentioned papers at the appropriate places.
Assigned Reviewer 2
----
Thaks for the review.
A Poisson proceess with finite mean measure $\mu$ can be sampled by sampling the number of (non-distinct) points to be sampled from a Poisson distribution with mean $\mu(S)$, say n, and then sampling n points indepndently from the normalized finite mean measure. The lines 842-844 consist of a product of product, whereby the first product corresponds to the fact that there are $n$ conditionally independent Poisson processes, whereas the second corresponds to the fact that points in a Poisson process are independent conditioned on the number of points present in the process.
Summing up the probabilities to get the marginal distribution of counts is a standard way to obtain the EPPF for normalized nonparametric measures. The proof is quite similar to the proof in Section 9.5 of Poisson processes by Kingman.
Since the number of words in each document is fixed, the CRF sampling scheme for HDP is exactly similar to Gamma-Gamma-Poisson sampling scheme if we assume a gamma distribution on the $\alpha$ parameter of HDP (see section 3.1 of Augment and Conquer NBP by Zhou et al.). In general, however, modelling with normalized CRMs for arbitrary CRMs is quite chalenging. Only by assuming a CRM-specific prior on the number of points in each object (for instance, negative binomial for gamma-Poisson), can we obtain the form derived in Theorem 2.
94% of the documents in NIPS corpus were used for training in (Asuncion et. al.). Our training fraction however varied from 30% to 70%. Moreover, the dataset used by us had 13,649 unique words as opposed to 12,419 in the mentioned paper. Although we didin't do a thorough hyperparameter search, SGGP was able to achieve a perplexity of ~1510 using exactly the same setting as in (Asuncion et. al.) for the dataset used in our paper. Hyperparameter learning must improve these results. We will add these results in the final draft.
Lastly, we would like to stress that there is absolutely no doubt about the correctness of the derived sampling schemes. As has been noted by reviewer 1, the theorems mimic the results derived for specific choice of CRMs, such as gamma-Poisson.
Assigned Reviewer 3
------
Thanks for the positive review.
In order to incorporate the unnormalized Pitman-Yor process one needs to introduce CRMs whose intensity measure is random, that is, $\nu(\dx, \dz) = L\rho(dz)\mu(dx)$, and $L$ is a random variable. When $\rho$ is the intensity measure of GGP and $L$ is gamma distributed, the corresponding normalized CRM is PY process. While it is straightforward to derive the results for sampling from such models, the resultant notations become too cumbersome. Hence, we had to remove the section on such models from our draft.