Paper ID: 717 Title: Rich Component Analysis Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes a cumulant approach to estimate cumulants of S_1 in a setting where two or more views of the type U=S_1+S_2 and V= A S_2 + S_3 are observed. The paper shows in comparisons with ground truths where S_1 is known and CCA that the proposed method RCA works really well. Clarity - Justification: A number of points are given below to illustrate points in the paper that are not so clear. Significance - Justification: If the method is as good as the simulations seem to indicate then this method could be of major practical interest. See also comments below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper claims to offer a major break through methods-wise but somehow details in the paper as listed below leaves this reviewer a bit doubtful about the significance. It is a big shame to reject an important contribution but it is the responsibility of the authors to write a convincing story. Major comments: 1. "It is information-theoretically not possible to deconvolve and recover the individual samples si. Instead we aim to learn the distribution Si(θ) without having explicit samples from it." Clarify this statement. ICA models are identifiable (i.e by assuming non-Gaussian components) such that sources may be recovered up permutation, scale and sign symmetries. 2. "Moreover, using CCA, it is not clear how to learn the distribution S1 if it is not orthogonal to the shared subspace S2." Clarify "orthogonal". CCA does the S components to be independent but set not restrictions on the matrices (A and B in the paper). 3. Give more detail about Chebychev polynomial for logistic function. How to derive and quality of approximation. 4. Contrastive regression example. Give more details. It is not clear whether Y is observed and how E[Y S_1]. 5. To test the performance of a joint model, we consider a friendly setting where we are given the exact parametric... It is hard to believe that RCA performs better than an example where the model matches data generation. The justification for this result is quite weak and begs more explanation and experiments. For example, one should then expect that in a model with fewer parameters maximum likelihood should do better than RCA. 6. The "real" experiment is a simulation where real data is used. 7. CCA is clearly used as a straw man here. Less obvious straw men could be of interest. Minor comments: 1. model of interest, S1 -> component of interest, S1 ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a method for learning partial latent representations from multiview data called Rich Component Analysis (RCA). The main idea behind the method is to exploit linear relations between the cumulants of observed variables and cumulants of latent variables. The proposed algorithms recover the later by plugging in empirical versions of the former into these relations and solving the corresponding equations. Once the cumulants of the latent variables of interest are (approximately) known, these can be used to learn parameters for the distribution of these variables under multiple parametric representations (eg. PCA, linear and logistic regression, mixtures of Gaussians, etc). Clarity - Justification: The paper does a very good job at presenting the main ideas behind the method. Significance - Justification: The experimental evaluation showcases the flexibility and statistical advantages of RCA over CCA and another naive baseline. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): [a] The paper is very convincing about the statistical advantages of RCA, but doesn’t delve too much on the algorithmic aspects of the method (with the exception of the comment starting at Line 324). Formally, RCA has a vague resemblance to the method of moments, which in some cases can be preferred over EM for reasons of speed (among others). My impression is that here the story is a bit more subtle because the model assumed for S1 determines the order of the cumulants that need to be recovered. The order of this cumulant will affect the complexity of recovering the cumulant itself and extracting from it the moments of interest. Thus, I think the paper should include a short discussion of the running time of the experiments and an idea of what combinations of t and d are tractable in practice. [b] The discussion in Section 3.2 on how to use RCA beyond the “simple” contrastive learning problem is interesting, but it doesn't no look very practical at the moment. Are there any known problems that could benefit from this general approach? Also, it feels like the L-distinguishability condition is somehow related to the anchor word assumption used eg. in topic modelling. Could you please comment? [c] It is reassuring that the supplementary material contains an analysis of the sample complexity of RCA. But, do these results follow from somewhat standard arguments, or is there a major contribution in the proofs worth mentioning in the main body of the paper? [Line 213] t or n variables? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a new algorithm (and its analysis) for disentangling "rich" latent components that are linearly embedded in observations. The algorithm is based on cumulants and comes with both rigorous theoretical guarantees and strong empirical verification. Clarity - Justification: The key ideas of the paper are clearly presented. But I'd love to better understand empirical details of the algorithm. This might be just my problem though (I don't have much background in cumulants). Do you need to write cumulants in terms of moments, and then estimate those moments from samples? Basically, it seems the method needs to convert from moments to cumulants (before the algorithm) and cumulants to moments (before the application). It'd be great to have some explanation on how to do this conversion. Significance - Justification: The paper contributes a new, rigorously analyzed, and empirically effective method to multi-view learning alongside other related methods (ICA, CCA). The proposed method is substantially more general than previous methods in that no assumption is made about latent signals. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I didn't carefully go through the proofs in the appendix in detail. Below are some minor comments on the main text: - There seems to be some discrepancy between the specific case (2 observations and 3 views) and the general case (k observations and p views). For the general case, Algorithm 1 (or 2) requires only the (L+1)-th order cumulants when the system is L-distinguishable. But for the specific case, the authors need 4th order cumulants (Eq. 2) even though the system is 2-distinguishable? - Move how A is sampled to the front of experiments. - Algorithm 3 is mentioned in the main text (Theorem 3.4) without a reference to the appendix. - "Satisfysing" -> "Satisfying" =====