Thanks very much to all reviewers for their feedback and for bringing up many interesting issues.$ We will incorporate the suggested notation changes, add more background and details around Schur decomposition, and reorganize Theorem 1 and add more explanations and discussion. Reviewer_1: We think it should be possible to derive a version of Theorem 1 using spectral norms. That would require some work on the proof and maybe a different assumption on the noise term. Tensor power iteration in the non-orthogonal case traditionally involves pre-whitening, which is known to suffer from a number of limitations. More recent approaches that can also deal with overcomplete models (e.g., arxiv:1402.5180/1408.0553) impose an incoherence or soft-orthogonality constraint on the tensor factors. Our method does not impose any soft-orthogonality constraints, but on the other hand it cannot be applied in the overcomplete setting. In (35), imposing a structure on the noise is indeed unnecessary. We reran the experiment with a non-structured noise and got essentially identical results. k(V) in Theorem 1 should be \kappa(V) and it is indeed condition number. In (16) it is indeed a Kronecker product. We will remove vector outer products throughout and replace them with matrix operations, to avoid confusion with Kronecker products. Reviewer_3: "is guaranteed to converge to a local minimum that up to *first order* is equivalent to global minimum”. We will revise this sentence and add some discussion around the proof of Theorem 1. Briefly, we characterize the critical points of the objective function (12) that are close enough to the zero-noise triangularizer U0. The characterization is based on a *first-order* approximation of the stationarity condition around U0, and hence it is valid for all critical points located in a small neighborhood of U0 (and hence all local and global minima; see also response to Reviewer_4). We finally note that Gauss-Newton is guaranteed to escape saddle points. Thanks for bringing the paper of Montanari & Richard (NIPS 2015) to our attention. A key difference between our approach and the approach of Montanari & Richard as well as other power-iteration or simultaneous diagonalization approaches, is that most of these methods try to recover the tensor components from the set of eigenvectors of the matricization, whereas we recover the tensor components from the set of eigenvalues of the matricization. The advantages of the latter are that eigenvalue perturbations are typically milder than their eigenvector counterparts, we can use Schur triangularization which is more robust than matrix diagonalization, we can optimize over the manifold of orthogonal matrices even in the non-orthogonal case, and we can use Gauss-Newton which is much faster than the state-of-the-art Jacobi algorithms. It would be desirable to have a more comprehensive numerical comparison with other methods, but our focus has been largely on the theory and we have deferred more experiments to subsequent versions of the paper. Reviewer_4: We will elaborate on local vs global convergence. Theorem 1 only proves the existence of Z*. This Z* could be attained by solving the optimization problem (12) to (near) global optimality. We provide some theoretical arguments and experimental evidence that Gauss-Newton with appropriate initialization can come close to Z*. Our arguments and proof techniques are somewhat similar in spirit to those used by Kuleshov et al (2015; arxiv:1501.06318), even though the two approaches differ. Experiments seem to indicate that our approach can be more robust than the approach of Kuleshov et al. (2015) in the low noise regime. Bounds that involve the condition number appear in rather different settings than ours (e.g., probabilistic setting, whitening, random unfolding, sampled noise; arxiv:1204.6703/1311.3287), making a direct comparison with our bounds not straightforward. We will elaborate on that and cite relevant results. We do not have an analogous perturbation bound for the case of simultaneous Schur triangularization where the observable matrices are random combinations of the tensor slices. Thanks for bringing the paper of Vempala & Xiao (COLT 2015) to our attention, and for suggesting additional improvements.