We thank the reviewers for their valuable feedback. For the final version, we will improve the presentation as suggested. Below we address the concerns of the reviewers.$

R1&amp;R4: [contributions]
Although some of the techniques we use are not new, our paper has a number of contributions (also summarized in lines 95-109 of our paper). More specifically:

(a) The introduced models are novel and, importantly, identifiability results are provided. These identifiability results also extend to discrete ICA by Podosinnikova et al., which were not formally stated there. To the best of our knowledge, this is the first identifiability result for the CCA-type models, while a significant number of the models proposed in the literature are not identifiable.

(b) While the methods in Podosinnikova et al. almost directly apply (with similar theoretical guarantees) for the discrete CCA model, a more general approach based on the generalized covariance matrices is proposed here. Although the underlying idea is not new, we believe the importance of this (to some extent, long forgotten) approach is significantly underestimated. A generalized covariance matrix contains higher order information. Hence, the 3rd (and higher) order cumulant tensors are replaced by matrices containing similar information. This greatly simplifies the derivations and decreases the computational complexity.

(c) We propose novel computationally efficient and numerically stable algorithms for identifiable multi-view models. Although the algorithms are based on a combination of known techniques, the connections are not obvious.


R4: [2. the whitening step?]
Whitening allows us to reduce the original problem of joint diagonalization (JD) of the matrices D_1 B_p D_2^T, where B_p is diagonal, to the JD problem of Q B_p Q^{-1}, where Q is a quadratic invertible matrix (we show this reduction in lines 677-697). We use existing efficient optimization methods to solve this nonorthogonal JD problem. Potentially, we could come up with an algorithm to solve the original JD problem of D_1 B_p D_2^T (this is different from the problem considered by Kuleshov et. al. and others). However, the existing methods worked well enough.

To clarify (see also lines 594-598 &amp; 697-707), Kuleshov et. al. (and others) consider classical joint diagonalization by congruence (JDC), where the matrices have the form D B_p D^T. There are a number of methods to solve this problem, including orthogonal JD or tensor power method (after whitening) and nonorthogonal JD (without whitening). Although NOJD used in our paper has the same name as nonorthogonal JDC, it refers to a different problem as described above (the two problems are equivalent only for orthogonal D/Q, which is not the case here).


R4: [1. comparison with ICA] 
Indeed, the three new models are particular cases of noisy ICA: x = D alpha + eps, which can be obtained using the stacking trick and with an additional Poisson assumption where needed. The main conceptual difference between our approach and ICA is that we do not require any assumptions on the view-specific noises eps_1 and eps_2 except for independence (see eq. 3). This is due to the form of the noise covariance matrix cov(eps) in eq. 14, since we then work with the blocks where cov(eps) is zero. This allows us to focus on the information common for both views (D_1 and D_2 in our case, but the inference of alphas is possible with known methods). On the contrary, the algorithms for ICA either assume that the noise is negligible or make some additional assumptions on the nature of the noise, e.g., Gaussianity.


R3: [comparison to rich component analysis (RCA)]
We thank the reviewer for pointing out the unpublished reference, which we will add in the revised version. Indeed, the RCA model is a more general multi-view model, where no independence assumption on the components of the view-specific noise of the common signals is made. This generality comes at a cost: there are no identifiability guarantees for the RCA model as opposed to our models (eqs. 4-6). Moreover, our models, as opposed to RCA, have a transparent link with CCA as each view is modeled as a view specific linear transform (our factor loading matrices D_1 and D_2) of common independent sources (our alpha). Further, the RCA algorithm is based on the 4th order cumulants, which is demanding in terms of storage and runtime (at least, regarding the estimation of A from their eq. 1) as well as has high sample complexity. Our algorithm, on the other hand, works with the generalized covariance matrices and, hence, has significant advantages in terms of all the mentioned criteria and is straightforward to implement. Our paper uses nonorthogonal JD of the cumulants. RCA uses cumulants for further density approximation to solve the maximum likelihood estimation and the underlying optimization method is SGD.