We thank all the reviewers for your time and thoughtful comments!$ Reviewer 1 Thank you for your positive feedback on our paper! 1. We can explicitly convert between moments and cumulants using Faà di Bruno's formula. We use k-statistics (which, exactly as you said, involves first estimating the moments) to estimate cumulants from samples. This is buried in the Appendix now (1255), and we’ll clarify this in the main text in the revision. 2. You are correct, the specific case of Sec. 3.1 can work with 3rd order cumulants. We chose to present the result with 4-th order cumulants is because many natural distributions (e.g. symmetric distributions) have a 0 3rd order cumulant which is not informative, but most natural distributions have nonzero 4-th order cumulants and satisfies the requirements of the algorithm. We’ll clarify this in the text. Reviewer 2 Thank you for your helpful feedback. We address each of your clarification questions in detail below and will revise the paper accordingly. 1. In the sentence you highlighted we are referring to the more difficult version of *overcomplete* ICA where the number of observation is *smaller* than the number of sources. You are correct that the standard ICA (where # of observations = # sources) is identifiable, but the overcomplete version is not. Even our simple model has 3 sources and only 2 observations, and so it is more similar to overcomplete ICA. We’ll clarify this. 2. What we meant in this statement is that CCA recovers the subspace of the correlation between U and V, which in this case is the subspace spanned by S2. However, it does not directly give us a way to learn S1 (or S3) in general. One approach based on CCA that we might try is to project the samples of U onto the directions orthogonal to the CCA subspace, in the hope that this “removes” the influence of S2 on these samples. But this wouldn’t work except in very special settings, because the projection is going to distort the distribution of S1. We apologize that this statement is not clear, and will flesh it out in the revision. 3. Good idea. We’ll add more details of Chebychev polynomials in the revision. 4. In contrastive regression, y is directly observed, along with U and V, where U = S1 + S2 and V = S2 + S3. Moreover, we assume Y = \beta^T S1 + independent Gaussian noise, hence E[YU] = E[Y S1]. If we have S1 directly, then this is the standard regression setting and we can find \beta via least-squares. But here we don’t have samples from S_1, only from U and V. Our method still recovers \beta by giving an unbiased estimate for E[S1S1^T]. 5. The joint model, even though it matches data generation, is quite complex to do inference over. It is significantly more complicated than the standard Gaussian mixture model where EM works well. One reason is that the samples from U and V have complex correlations which is itself a mixture of Gaussians. RCA is guaranteed to recover the correct parameters while EM could potentially get trapped in local minimum (which we do see experimentally). It is possible that EM with better initialization (possibly using RCA!) can get a better result and this would be an interesting future direction. 6. We used the data from actual DNA methylation biomarkers to capture the complexities of empirical distributions. In order to ensure that we have the ground truth to evaluate the method, we generated synthetic samples from this data. 7. We compare to CCA because it is a general purpose tool that is most relevant to our setting. Other tools are more specific and we are not aware of any that could work for all the different settings in our experiments. In special cases, like the contrastive mixture example, we can write down a joint model and compared RCA with EM. Reviewer 3 Thanks you for your helpful feedback and your support for our paper! [a] This is a good idea and we’ll add a section on the practical guidelines for using RCA. In our experiments, it was sufficient to use relatively low order approximations and thus we only needed to compute up to 3rd and 4th order moments, which allowed the method to be quite scalable. [b] Good observation. Yes, the concept of L-distinguishability is similar to the anchor words assumption, but L-distinguishability is more general in two ways. First if anchor components exist then the system is 1-distinguishable, but the other direction is not true. For example {1}, {1,2}, {1,2,3} is not separable but is 1-distinguishable. Second when L is at least 2, then it’s more like we have “anchor sets” rather than “anchor words”. [c] Our sample complexity analyses use mostly standard tools from matrix perturbation analysis. We’ll say a bit more about the analysis in the main text.