We thank all the reviewers for their valuable opinions and concerns. Here we address some of those concerns.$
All Reviewers: Lack of Comparison. First we would like to note that in (Gopalan, 2014) and (Gopalan, 2015), the authors showed that HPF significantly outperforms PMF (Salakhutdinov and Mnih, 2008), LDA (Blei, 2003) and NMF (Lee and Seung, 1999) on Netflix and Echonest data sets on the recommendation task. We do not repeat the same experiments because of redundancy and computational burden. We instead compared HCPF directly with HPF, which dominates the related methods. We will make this point explicit in the text. 

Response for Reviewer_1: Intuition Behind X_i and Utility of Explicit Modeling of Missing Data. 
X_i can be thought as the atomic contribution of each factor. In HCPF, X_i is a random variable whose distribution is an additive EDM whereas in HPF it is a point mass at 1. This difference allows HCPF to handle different types of data. Reviewer_2 notes that a flexible X_i allows for more flexible probabilistic response models. More importantly, compounding eases the tight coupling of the response and sparsity models (Th3). The decoupling hypothesis is borne out in the results, where we see a significant improvement in test likelihood across our twelve data sets with HCPF (Table3 and Table4). In Table5, we obtain an improvement in AUC over HPF using HCPF with the appropriate choice of the element distribution. (also see http://i.imgur.com/bMwOSuJ.png for RMSE comparisons, which we add to the manuscript).

Response for Reviewer_2: Prior and Posterior Sparsity Estimates.
Reviewer makes an excellent point regarding Fig1. Here, we elaborate on this issue and provide further evidence for the tight coupling in HPF. The red lines represent our prior knowledge of sparsity in the data sets (i.e., observed average sparsity level). Posterior sparsity estimates for each cell, however, will be different. In fact, if the data are not missing at random, we should get significantly different sparsity estimates for missing and non-missing entries. In the Movielens data, HPF trained on the full matrix has the average expected posterior sparsity estimates of 0.94 and 0.47 for the missing and non-missing test entries respectively. Despite being different than the empirical prior estimate, the sparsity level for the non-missing test entries is high and, from Theorem3, we expect a tight coupling between sparsity and response models. HPF has an average posterior expected response value of 1.89 for the non-missing entries, compared with a true average response value (3.58) and the average expected response value of HCPF-PO (4.22). RMSE was 2.21 and 1.53 for HPF and HCPF-PO, respectively. Additional evidence is in the non-missing test log likelihoods in Table4 (-4.808 for HPF and -1.756 for HCPF-PO). We see a similar pattern in other data sets. The only exceptions are Merck and Bestbuy, although we emphasize the near-binary characteristics of those datasets in the text. In short, the posterior estimates for the response and sparsity models confirm our hypothesis that HPF has a detrimental tight coupling between sparsity and response models in extremely sparse data sets.

In terms of prior work, we briefly mentioned the beta divergence approach to NMF in the introduction (Fevotte &amp; Idier, 2011). However, (Yilmaz, 2012) and (Simsekli, 2013) are certainly more relevant as they are presenting the connection between beta-div and compound-Poisson-gamma. We will include a discussion of the two. In terms of a comparison, however, the algorithm presented in (Simsekli, 2013) does not use a stochastic optimization routine, and will not scale to the data sets we consider. We expect the performance to be similar to HCPF_GAMMA.

Response for Reviewer_3: Theoretical Investigations.
To fit the HCPF, we used stochastic variational inference (SVI). The convergence guarantees and the choice of learning rates are analyzed theoretically in (Hoffman, 2013). In particular, they show that the lower bound on the posterior log likelihood is maximized with SVI. The runtime of SVI on HCPF is comparable to HPF, especially for discrete data.

Sampling Missing Entries. Sampling an equal percentage of missing test entries is not computationally feasible. In the Netflix data, for example, 20% of missing entries represents ~1.7 billion entries. Instead, we subsample the same number of missing entries as we have non-missing entries for testing. When training on the full matrix, we randomly chose an entry and, when it was not in the non-missing training set, the non-missing test set, or the missing test set, we concluded that it was a missing training entry. It is not feasible to store indices of every missing entry in the sparse matrix.