We thank the reviewers for their constructive comments.$
* Applying PCA prior to applying NCCA, seems to contradict the claim that KDE would manage high dimensions if the intrinsic dimensionality was low (R1):
That’s an excellent point, which we indeed did not clarify. We do not use PCA for improving the KDE accuracy. We use it only for reducing the computational load of finding nearest neighbors. Note that if we retain, e.g., 99% of the variance in the PCA, then the distances ||x_i-x_j|| in the KDE are barely affected. But computing these distances is much more efficient if the dimensionality is significantly lower. This is true whether we use brute-force search (as we did in the paper) or approximate nearest neighbor algorithms, such as LSH. We will clarify this.

* The comparison to KCCA is very thorough, perhaps add a similar comparison to DCCA (R1):
In terms of experiments, we always compared to both KCCA (tow variants) and to DCCA. In terms of discussions, due to space limitations, we indeed chose to focus on the differences between NCCA and KCCA. That’s because, superficially, they may seem similar (both are nonparametric methods that use eigendecomposition of some kernel). DCCA, on the other hand, is a parametric method and is thus obviously very different from NCCA. 

* NCCA involves density estimation in high dimensions, which seems to be a more difficult problem than nonlinear CCA itself (R2):
Thanks for this comment. We actually believe that nonparametric density estimation is not much more difficult than the nonlinear CCA problem which doesn’t assume any functional restrictions. Due to the unit-norm constraints, the maximum-correlation objective can be equivalently written as a minimum MSE one: min_{f,g} E||f(X)-g(Y)||^2. Thus, nonlinear CCA is similar to a nonparametric regression task (over the dimensions of X and Y), where each view tries to predict the projection of the other view. Now, nonparametric regression is known to be just as difficult as nonparametric density estimation: For both problems, the minimax rate of convergence in dimension d is O(n^{-4/(4+d)}) (i.e., there doesn’t exist an estimator with faster convergence). Therefore, unless we restrict attention to smooth functions (as in KCCA), which we don’t, KDE doesn’t seem like overkill.

* Add empirical analysis around datasets where NCCA fails to outperform other baselines (R3):
That’s a good point. As we discuss in lines 857-863, NCCA often requires large training sets to perform well on high-dimensional problems. Indeed, with less training data we observe a drop in performance. In such settings, NCCA is typically outperformed by DCCA, but it still outperforms KCCA (see numbers in line 863). We will add a more detailed analysis of this behavior.

* The approach is transductive. Has the quality of the Nystrom out of sample extension been studied carefully? (R3)
That’s true. NCCA, just like KCCA, is a transductive method and thus requires out-of-sample extension.
We verified the quality of our out-of-sample-extension in two ways. First, we generated data from a multivariate Gaussian distribution, and measured the deviation of the projections of test samples (obtained using our out-of-sample extension) from the known analytic solution (Hermite polynomials). The fit was very good, apart from slight inaccuracies in low-density areas, which barely affect the correlation and are typical of the Nystrom method. Second, on the MNIST experiment, we compared between computing the projections of the test data using the Nystrom extension, and computing them by concatenating the test data to the train data and solving a larger NCCA problem. The difference in performance was tiny, but the Nystrom method was significantly faster.