Thanks for all the insightful comments. We will fix the typos and revise our writing accordingly. We will first discuss two main points (mostly in response to the questions raised by reviewer 3) and then answer each reviewer individually.$
First, we think large-scale CCA is an important and commonly studied problem (as evidenced by the many cited ICML and NIPS paper in recent years). Also, as pointed out by reviewer 4, CCA and generalized eigenvector problems "are fundamental problems, and could be the basis for more complicated downstream methods (analogous to using PCA for PCR)".

Second, our algorithm is indeed "the first such algorithm with global linear convergence" for these problems. By "such" we mean "linear time", that is, linear w.r.t both input size and k up to a log(k) factor (we consider k < < d). Having near linear run time is essential for an algorithm to scale to large data sizes. For existing methods to solve generalized eigenvector problem:
i) Power method on B^{-1}A with exact matrix inversion of B has runtime O(d^2.373) (not linear time).
ii) Power method with linear system solver for approximately solving B^{-1} would require linear system solver to be very accurate in the last several iterations of power method, a standard analysis gives a runtime O(log^2(1/epsilon)) (not linear convergent).

Similarly for CCA, all previous popular methods are either matrix inversion (not linear time) or non-convex optimization (not global convergent) based. One of our main technical contributions is quantifying the error tolerance and showing that a good warm start works (as pointed out by reviewer 1&amp;4), which makes it possible to obtain a linear time global linearly convergent algorithm.


For reviewer 1:
We will provide more intuition about proof in main paper, modify the experiment graph to better reflect the linear convergence property as predicted in theory.

CCALin starts out low compared with S-AppGrad: while S-AppGrad makes progress in every iteration, CCALin only makes progress every outer-loop in power method, it requires at least a certain number of FLOPs to approximately solve linear system before it can give a good estimate on top eigenvector.


For reviewer 3:
Although power method and linear system solver are standard tools, simply combining them gives a run time of O(log^2(1/epsilon)) which is not linearly convergent. As pointed out above, our main technical contribution is quantifying the error tolerance and a good warm start. They are the reasons for the lower complexity of our algorithm. To the best of our knowledge, this analysis is not standard, and not known before. Tricks like absorbing B^-1/2 into initialized vector does make initialization a bit easier, but not as important as warm start.

Minor points:
For line 185, sorry for the confusion, it's a typo. In line 169, 183, 184, we actually mean left multiplying B^-1/2 instead of B^1/2, which makes line 185 correct.

FLOPs are calculated as the total amount of floating point number operations used by each algorithm, which follows the convention of (Ma et al 2015). FLOPs are used over clock time mainly to avoid the unfair comparison due to the highly optimized matrix inversion code.


For reviewer 4:
We will explain more intuition about warm start in our main paper.

The rough intuition of warm start: Although power method requires higher accuracy of linear system solver to solve B^{-1}Aw_t as iteration goes, we also have w_t, w_{t+1} being closer to each other, and a scalar multiple of w_t serves naturally a good initialization for B^{-1}Aw_t. A careful analysis shows the ratio of required accuracy and initial error of linear system remains constant even when the angle < w_t, v_1 > goes to zero, which implies most linear system solver such as Nesterov's AGD only requires a small number of iterations (independent of epsilon) even in the last several rounds of power method.

Also, thanks for pointing out the issue of choosing parameter Q for Nesterov’s method. In this paper, our main focus is more on blackbox accessing to any linear system solver, Nesterov’s method is only one example to achieve a good rate. We can leverage the large body of work that deals with the estimation of Q and use any linear system solver from those works. In case without good estimate of Q, we can still use gradient descent for our linear system solver, which will still give a linear time global linear convergent algorithm (in return for worse condition number dependence). We agree that the estimation of Q is important in practice. Currently, our choice of Q in the experiments is indeed based on regularization parameter. We will include more discussion in paper as you suggested.

In experiment, CCALin is actually better than S-AppGrad in both medium and high TCC regime, as CCALin requires fewer FLOPs for the same TCC. We believe CCALin has a faster convergence rate in practice (see Table 2 for theoretical comparison).