Paper ID: 825
Title: Hierarchical Compound Poisson Factorization

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper introduces a variant of Hierarchical Poisson Factorization (HPF) for non-negative matrix factorization that decouples  the sparsity of the observed matrix and the model of distribution of the observed entries (response model). Section 2 and 3 introduced background on compound Poisson distributions. Section 4 introduces the Hierarchical Compound Poisson Factorization (HCPF) model as well as a variationnal inference algorithm. It is showed to subsume the HPF model, seen as a degenerate case of HPCF at then of section 4. Section 5 illustrates the benefits of using HPCF over HPF on a few experiments, using benchmark data from real data sets. Here, variants of HCPF with different response models are compared against the vanilla HPF method.

Clarity - Justification:  The paper is reasonably clear. The contribution is the model, and to some extend the algorithm (although it is simple). The experiments look promising, and seem to be reproducible.

Significance - Justification: The experiments look promising, and seem to be reproducible. The significance of the contribution is hard to evaluate: investigating the algorithm from a theoretical point of view, or providing further comparisons with the literature  seems to be required to assess this point.  Even though the paper lacks theoretical guarantees, the proposed method looks indeed promising and should perhaps deserve attention.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper provides an algorithmic contribution, with illustrative experiments on a few data sets. the experiments look promising.  The experimental setup looks clean enough, and the specificity of the data are provided in clear way. "We sampled an equal number...":  do you mean an equal percentage? this is slightly confusing. The name of  {\cal L}_NM  refers to non-missing entries, but the definition seems to also considers missing entries. When you sample "from the full matrix", how do you handle the missing entries? This is a little confusing to me.   The contribution of the paper would have been stronger with a theoretical study showing the benefit of the approach in terms of the factorization accuracy.    The paper is reasonably clear. The contribution is the model, and to some extend the algorithm (although it is simple). The experiments look promising, and seem to be reproducible. The significance of the contribution is hard to evaluate: investigating the algorithm from a theoretical point of view, or providing further comparisons with the literature  seems to be required to assess this point.  Even though the paper lacks theoretical guarantees, the proposed method looks indeed promising and should perhaps deserve attention.  Decision: borderline, tending to accept.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a non-negative matrix factorization framework based on the compound Poisson likelihood.  The framework can flexibly model a large number of data types including real, positive, discrete, and zero-truncated discrete in a principled way by specifying different compound element distributions.  A stochastic variational inference algorithm that allows for efficient parameter estimation on large datasets is provided for all of the different variants.  A useful property of exponential dispersion models (EDMs) that facilitates a general inference scheme (Theorem 2) is proven.   Another property of compound Poisson random variables (Theorem 3) is proven and used to claim deficiencies in an existing baseline model, hierarchical Poisson factorization (HPF).   All seven variants of the framework along with HPF are fit to nine datasets where at least one of the variants assigns a higher log-likelihood to heldout data than HPF for eight of the nine.  A subset of the variants are then fit to three more non-discrete datasets to demonstrate the flexibility of the modeling framework.

Clarity - Justification: This paper is very well-written and a pleasure to read.

Significance - Justification: The proposed model family is fascinating, beautifully presented, and indeed very flexible.    This model family can express everything in Table 1 using the very simple and scalable algorithm given in Algorithm 1—that is a great contribution. Furthermore the theoretical properties presented about EDMs and compound Poisson random variables are interesting and widely applicable.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this paper relies on sleight-of-hand to exaggerate the deficiencies of HPF which the current narrative uses as its main motivation for the proposed model framework.  Here’s where I think sleight-of-hand is happening: Figure 1(a) shows the PDF of a zero-truncated Poisson variable “at various levels of sparsity”. If I understand this correctly, it says that if we fit a single Poisson distribution with rate \lambda to a bunch of counts, 99% of which are zero, the PMF of the Poisson distribution will naturally concentrate its mass towards zero.  If we then zero-truncate this Poisson—i.e., look at where the PMF concentrates its mass, other than zero—the mass concentrates at 1.   This intuition is then used to suggest that HPF, which assumes each observed count is a Poisson random variable, will learn degenerate point masses at 1 for counts it knows to be non-zero.  This would be true if HPF assumes that every observed count was i.i.d. from the same Poisson and would thus learn a single \lambda with the distribution illustrated in Figure 1(a).  However, like all factorization models, HPF assumes that each count is an independent (but not identical) Poisson random variable.  Thus, the level of sparsity of the whole dataset is not indicator that all the \lambdas learned by HPF will be near zero.  Many will be significantly greater than zero; this is why Poisson factorization is actually quite successful at modeling highly overdispersed counts in the presence of high sparsity.  The empirical results are presented as evidence of what I believe to be misleading claims about HPF.   The paper pits seven variants of the framework against one baseline model.  On seven different datasets, one (or more) of the framework variants beats the baseline and on one dataset the baseline wins.  This does not completely rule out the null hypothesis that all eight models are essentially the same and that the one baseline “won” 1/8th of the time, randomly.  The magnitudes of the log-likelihoods appear to tell a different story; but this paper would be much stronger if it obtained these strong results against more than one baseline.  If I believed the intuitive explanation for why HPF was a weaker model, I probably wouldn’t emphasize this weakness of the paper.  This paper is missing discussion of two different threads of relevant prior work.  First, compound Poisson factorization models have been explored by Yilmaz and Cemgil (2012) [1] and Simsekli et al. (2013) [2].  This work includes similar discussions of Exponential Dispersion Models, their relevance to sparse data, and variational inference for fitting compound Poisson factorization models.  Second, there are many baseline approaches to dealing with missing data in collaborative filtering.  Simple down-weighting and down-sampling approaches to missing entries are given by Hu et al. (2008) [3] and Rendle et al. (2009) [4].  Many more sophisticated approaches have since been given, including a very recent one that explicitly decouples the observation and response model within a Poisson factorization framework (Liang et al. 2016 [5]).  The latter reference is highly related but recent enough that its absence in the references should not be counted against this paper.  However none of this prior work in collaborative filtering on handling missing entries is discussed or compared to in the experiments.  To sum up, this paper puts forward a really nice model family that I believe has wide applicability.  It falls short in the empirical results section where it compares to only one baseline model that I believe the paper frames in a misleading way.  It is also missing discussion of relevant prior work.  [1] Y. Yilmaz and A. Taylan Cemgil. Alpha/beta divergences and Tweedie models. arXiv preprint arXiv: 1209.4280, 2012.  [2] U. Simsekli, A. Taylan Cemgil, and Y. Yilmaz.  Learning the beta-Divergence in Tweedie Compound Poisson Matrix Factorization Models.  In Proceedings of ICML, 2013.  [3] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 263–272. IEEE, 2008.  [4] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 452–461, 2009.  [5] D. Liang, L. Charlin, J. McInerney, and D. Blei. Modeling User Exposure in Recommendation.  In Proceedings of WWW 2016.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an extension of hierarchical Poisson factorization where the rating is sampled from a compound Poisson distribution (rather than Poisson distribution). Variational inference has been developed for the hierarchical compound Poisson factorization.

Clarity - Justification: It seems to be interesting to incorporate the compound Poisson distribution into the factorization for recommendation with missing data. However, it is not clear how exactly the compound Poisson distribution is related to the response model. For instance, in X+=X_1 + ... + X_N, X_i corresponds to what in the current problem?

Significance - Justification: In a nut shell, the main difference between HPF and the current method is in the response model. In the former, the rating is sampled from a Poisson distribution and in the latter, the rating is sampled from a exponential dispersion model whose parameter is drawn from a Poisson distribution. I think this is the key idea to decouple the sparsity model and the response model. Is this really useful to handle nonrandom missing data?

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): First of all, I would like to say that the presentation should be improved, since I had to look at the paper on HPF to follow what the current paper is presenting. It is not clear why the current method is a good candidate for nonrandom missing world. For nonrandom missing data, there are a few existing work, most of which considered a missing mechanism combined with a response model. One of recent work might be a binomial mixture model for nonrandom missing data. Since the paper is about a factorization method for recommendation systems, the idea of using compound Poisson distribution should be well blended with the recommendation problem, with more generous and better description. 

=====