Paper ID: 1269 Title: Tensor Decomposition via Joint Matrix Schur Decomposition Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper gives a new algorithm for decomposing full-rank 3rd order d x d x d tensors using approximate joint Schur decompositions. The main results are: 1. Introduction of a new iterative (deterministic) algorithm based on a Gauss-Newton updates for decomposing 3-rd order tensors (CP decomposition) in the presence of errors, when the components of the decomposition are non-singular (full rank case rank R<= d). The advantage over popular methods like tensor power iteration is that this applies to non-orthogonal decompositions, and it seems more robust that simultaneous matrix decompositions. 2. A technical perturbation analysis that shows that finding the best joint Schur decomposition can recover the factors. Clarity - Justification: The introduction and the context for the work is well-written. However, the writing is too dense and technical, and as written it is understandable solely to the experts on this problem. While there is a Theorem (Theorem 1) which gives the perturbation analysis for the result, there is no effort made to interpret the results and compare it to state of the art. Firstly, what does Theorem 1 say about recovering the factors of the decomposition? Secondly, how does it compare to the extensive previous work on tensor decompositions? Is the dependence on the condition number better? Is the dependence on the error term better than for simultaneous decompositions based on random combinations? I think the paper would be a lot more compelling with better writing. Significance - Justification: Tensor decompositions are increasingly important in learning various latent variable models (using the method of moments/ spectral learning approach). There are two main criticism against these methods: (1) Non-robustness of these methods to noise in the data (bad dependence on condition number of factors, sampling error), compared to other optimization-based methods (2) Handling modeling errors (this is out of the scope of this work). For instance, the algorithm of Harshman/ Leurgens et al. uses random combinations of tensor slices to obtain matrices for simultaneous diagonalization (using generalized eigenvector decompostions). However the performance of these methods depend on the separation between resulting eigenvalues which could be fairly small (~ 1/k). This work addresses (1), by introducing a different optimization-based approach towards solving the simultaneous matrix decomposition problem: by instead solving a related "Joint Schur-decomposition problem". I like this different algorithmic approach -- it does seem inherently more robust, and considering that Gauss-Newton based methods have better convergence guarantees than first-order methods, this seems like a good approach. I also like the experimental which does indicate that this could be better than simultaneous diagonalization of random matrix slices (though [De Lathawer et al.]'s algorithm matches their performance). The idea of using a joint Schur decomposition approach is not completely new, and is certainly inspired by De Lathauwer et al. However, they look at a modified version of De Lathauwer's objective (this difference can be brought out more clearly) and they look at a Gauss-Newton based iterative algorithm for this. This seems to work as well as De Lathauwer et al. empirically, and seems more amenable to analysis. However, the technical analysis presented in the paper can be much clearer. There are some issues with global convergence (though they prove the local optima is close to global optima under some conditions of the parameters), and it would be good if these are addressed clearly. Also, while the experiments claim this is better than existing methods, it is very hard to do a comparison of the theoretical guarantees viz. perturbation bounds, as written. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. In the introductory paragraphs, the authors should clearly mention that they are trying to handle the full-rank case when R \le d (and the factors are non-singular). This is to distinguish this work from the lot of recent work on handling overcomplete tensors. The first mention of this is in section 2, where it says "for simplicity assume that R =d". On a related note, for M_k to be valid, \hat{m} should have an inverse -- this corresponds to the non-singularity of the matrix Z. 2. I like the fact that the authors do a good job of giving due credit to earlier works that gave perturbation guarantees for simultaneous matrix diagonalization methods. However, it doesn't mention other works that try to overcome the problem of separation between the eigenvalues like [Vempala Xiao, COLT 15]. 3. The statement of Theorem 1 is very hard to parse as stated. There are too many terms floating around, and considering this is the main result of the paper, I think the authors should spend a lot more time explaining this theorem. Is U in (16) supposed to be U_0, which is the optimizer of (12) with \sigma=0? 4. There are notational issues and typos: First U_m refers to the update after the mth step in (21) while U_0 corresponds to the triangulizer of the exact problem (with no noise). Also, O(n) is not consistent in section 3. A scripted version of O(n) e.g. \mathcal{O} would greatly help. 5. The authors should clarify about the issue of local vs global convergence for their algorithm. Section 3.1 (and Theorem 2) claims that U_m is close to U_0. It would be nice to say why this is so (in particular what is the norm of X, in terms of the condition number of Z etc.)? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides a new method for tensor decomposition via joint matrix Schur decomposition. A set of matrices are extracted from tensor and an approximate joint Schur decomposition is applied to them to recover the rank-1 components of the tensor. The robustness to noise is provided, but it is only an existence result. Later, a new Gauss-Newton method is also proposed to recover a local minimum of the non-convex optimization problem. Clarity - Justification: The presentation is OK, but can be improved. After the introduction, the paper suddenly jumps to so many technical details. It is in general acceptable since this is a theory paper, but there could be improvements. For instance, the main optimization problem in (12) can be explained in more details. For someone who is not familiar with Schur decomposition, it is hard to understand that. The transition from (11) to (12) needs more explanation. There are also so many details in Theorem 1 which are probably inevitable for completeness, but there could be some simpler versions and more explanations provided. For me that I am familiar with tensor decomposition techniques, it was easy to follow the tensor discussions, but still not easy to read the Schur decomposition stuff. Significance - Justification: I believe the paper has sufficient amount of contributions. The way matrices are formed is interesting leading to the simultaneous diagonalization formulation for tensor decomposition. An approximate method for solving the non-convex optimization problem is also provided. For a theory paper, sufficient experiments is also provided. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The final bound in (14), and everywhere else in the paper, the norms are Frobenious norm. It could be much larger than the spectral norm. Isn't it possible to adapt the results to spectral norm? Is there something fundamentally limiting with the algorithm and/or analysis that the Frobenious bounds show up in the bound, or these are not tight bounds? Comparing with tensor power iteration, you only mention the orthogonal case. There are also analysis of tensor power iteration in the non-orthogonal and overcomplete settings. For instance, see "Learning Overcomplete Latent Variable Models through Tensor Methods". I have the following detailed comments too: Experiments: In (35), why are you imposing a rank-p structure on the noise too? Usually noise is not structured like this! Presentations: As I mentioned above, I suggest to provide more background and details on the notion of Schur decomposition. There are some typos: Line 190: extra "we" line 201: 1 -> I Line 232: M_k is NOT defined in (5). Line 459: Theorem (1) -> Theorem 1 Notations: Kronecker product is different from what you are using in tensor decomposition. This product is usually called outer or tensor product. Note that the Kronecker product of two d-dimensional vectors is a d^2-dimensional vector. What is k(V)^3 in the result of Theorem 1? do you mean condition number? Equation (16): I do not understand the outer product here. The outer product between to matrices is a tensor with four modes. Is it a Kronecker product? If yes, you are using \otimes for both Kronecker and outer products which is confusing. Better to use ||A||_F for Frobenious norm, ||A|| is usually used for spectral norm. In the synthetic experiments, why do you use a new notation p for rank? There is no need to introduce extra notations. You earlier use R. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes to tackle the tensor factorization problem as a matrix factorization problem where the matrices to factorize are obtained from the tensor by slicing the input tensor. The authors use a simultaneous diagonalization approach that they solve using an iterative Gauss-Newton method. Clarity - Justification: The paper is globally well-written. It is unclear for me what the authors mean in the abstract by "[...] is guaranteed to converge to a local minimum that *up to first order* is equivalent to global minimum". What has first order to do with minimum. It is the second order information that determines if we are dealing with a maximum / minimum or (very often the case) saddle point. Significance - Justification: There should have been "Average" for significance. The paper uses a methodology which is very standard for tensor factorization: flattening / unfolding / matricization. Here the unfolding is performed in a special manner, but this does not change the quality of the results. The idea of simultaneous diagonalization has been explored already by Kuleshov and is cited here. It has been proved by Montanari & Richard (NIPS 2015) that matricization performs worse that matricization followed by power-iteration. Authors do not cite that paper and do not compare their results with that approach. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall the only contribution of the paper is to use a better optimization algorithm (Gauss-Newton) for optimizing the objective. The authors forget to cite and compare their results with a paper that has better results than their approach (even though in a rank-1 case only). This makes the statistical contribution questionable. The empirical results are interesting. Plots should have been bigger. "dogs" and "birds" datasets could have been explained a bit better. =====