Paper ID: 182 Title: Learning from Multiway Data: Simple and Efficient Tensor Regression Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors of this paper propose an optimization algorithm for regression problems parametrized by a low-rank tensor. The low-rankness of a tensor is defined in the sense of Tucker decomposition. The proposed algorithm is a projected gradient algorithm. The projection to the set of low-rank tensors is carried out approximately with an iterative algorithm (Algorithm 2). The authors provide an estimation error bound based on a tensor generalization of restricted isometry condition. Sketching is also used to deal with large sample sizes. Empirical comparison on synthetic and real datasets seem to show the advantage of the proposed method. Clarity - Justification: There are unclear assumptions in the proof of Theorem 5.3. Please see detailed comments. Significance - Justification: The proposed algorithm requires projection to the set of low-rank tensors in every iteration. It was unclear how this can be carried out even approximately. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There are several weakness in the main theorem (Theorem 5.3). The most obvious one is that the authors do not show that the RIP assumption with respect to the tensor rank holds. There is a line (#404) saying that this is true with high probability. I assume that this is about previous work in the context of sparsity of low-rank matrices. The authors need to be more specific (e.g., cite a paper) if I am wrong. Second, Lemma 5.2 at most guarantees that Algorithm 2 finds a stationary point. It wasn't clear why it can avoid a saddle point or avoid finding a degenerate solution like U_1=[u_1, …, u_1], U_2=[u_2, …, u_2], U_3=[u_3,…,u_3], where (u_1, u_2, u_3) is the tuple of the first singular vectors of W~. This weakness shows up in inequality (5) in the proof of Theorem 5.3. The authors argue that u(W^{t+1}) < = u(W*) but this is not true unless Algorithm 2 finds a *global* minimizer of the projection. Finally, there is an unclear assumption C^2 ||E||_F^2 < = L(W^k) (around line#509). Since L(W*) = ||E||_F^2 and C smaller than one would make the assumption below (10) false, this is saying that W^k cannot overfit at all. Is this realistic? How large do we need C to be to have the theorem with delta_{2R} < 1/3? Please comment. Minor comment: Regarding the restaurant dataset in Figure 2 (a). The result given in Wilmalawarne et al 2014 seems to be on par (at least around 2000) with the proposed method. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an optimization method for (Tucker) low-rank tensor regression based on projected gradient descent. This relies in an efficient projection algorithm based on power iterations that the authors develop. The authors guarantee the convergence of the method by using the RIP assumption. They also used count sketch to make the algorithm faster. Clarity - Justification: The paper is clear overall, although some parts could be further elaborated. Particularly, the count sketch explanation in the paragraph before Sec. 5, seems confusing or incomplete. In lines 81-90, it is mentioned several times "tensor regression form" and "tensor regression model", but it is not completely clear to me what it is meant by that. Do they make reference to the loss function or regularizers used? Significance - Justification: The paper provides useful results, both in terms of demonstrating convergence of the optimization algorithm, as well as showing empirically that the resultant method, while being really fast, outperforms the other competitors in several tensor learning tasks. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors should comment on the relationship of their developments with the paper Rauhut, Holger, Reinhold Schneider, and Z. Stojanac. "Low rank tensor recovery via iterative hard thresholding." Proc. 10th International Conference on Sampling Theory and Applications. 2013. In the definition of RIP (5.1) I think that in "The isometry constant of W" W should be replaced by X. In the eq. 6 and 7, the super-index t should be replaced by k. There are typos across the text: - 81: data tensor at every modes: remove last s - 379: an zero: remove n - 448: a optimal: an optimal - 502: There is a missing bar - 794: TPG outperform: add s - 801: Figure 3 show: add s - 805: provides interesting: provides an interesting - 856: like first order method: add s Summary: The proofs look correct, the paper reads well, and the experimental section provides results that support the authors claims, both in terms of accuracy, and efficiency. Provided that the authors address all these comments, I think the paper should be accepted. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Tensor regression by a subsampled tensor projected gradient method, with memory requirements linear in the problem size, and avoiding full SVD computations. Clarity - Justification: The methods are presented in a clear way. Significance - Justification: Developing large scale methods for tensor regression is challenging. The significance is difficult to assess because also other efficient techniques that avoid computing full SVDs have been proposed in literature. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The scope of the paper is important: to develop methods for large scale tensor regression. The computation of full SVD should be avoided at this point as the authors explain. On the other hand, also other methods have been proposed that avoid computing full SVDs, e.g. rank-one tensor updating algorithms for tensor completion. It would be good if the authors could check the more specialized literature on large scale tensor methods to compare with either qualitatively and/or numerically. =====