We thank all the reviewers for helpful comments.$ ====Assigned_Reviewer_1 > the model assumed and the results obtained are more or less straightforward and is along expected lines. Our main novelty is not only the AMNR model itself but also its theoretical analysis. Although several authors have proposed similar models (e.g. Signoretto+), none of them have studied about their statistical properties such as the rate of convergence of the estimator and how to control the model complexity. ====Assigned_Reviewer_2 1) Will the models easily suffer from over-fitting problems? No. We claim our method is robust against over-fitting because of - Bayesian estimation with the GP prior. - Regularization by limiting M and R. By setting M and R as small, the model complexity is suppressed, and over-fitting is prevented. The experimental results (Figure 2 and 3) support our claim. 2) How can we decide the non-identification issue for low-rank or full-rank data? We may not understand your question correctly. Does “decide” means “detect”? Then, our answer is: we don’t need to check whether X is identifiable or not. By using the random sign flipping, we can consistently avoid the non-identification issue. See Remark at Section 3. 3) Intuition for doubly decomposing “Doubly decomposing” means decomposition for both (a) input tensors and (b) the regression function. To improve accuracy of the prediction, we want to reduce unnecessary complexity of the model that worsens efficiency of estimation (i.e. more samples are needed to achieve the same accuracy.) For that purpose, we use CP-decomposition to reduce complexity of input tensors, which is (a). This decomposition naturally induces the decomposition of the model function (Eq.6), which is (b). 4) How is the computational efficiency and scalability? See Section 7. To sum up, - the time complexity is cubic in nR, where n is the sample size and R is the rank of CP decomposition. - this is not so scalable but can be improved by e.g. Nystrom method. 5) Why it will perform better than the literature methods? Because “double decomposition” improves estimation performance (see our answer to 3)). ====Assigned_Reviewer_4 1) Can authors comment more on non-unique CP decomposition and the sign flipping? How is it performed and how does it exactly limit the sample size? The sign flipping is used mostly for the theoretical guarantee. In actual implementation, we don’t use the sign flipping (though our algorithm works well.) The sign flipping reduces the effective sample size depends on K, the order of X. For example, at K=2, there are 2 combinations that represent the plus sign (++ and --) and 2 combinations that represent the minus sign (+- and -+). This reduces the effective sample size to half. However, the reduction doesn’t depend on n and the order of the convergence doesn’t change, i.e. O(1/\sqrt{0.5 n}) = O(1/\sqrt{0.5} * 1/\sqrt{n}) = O(1/\sqrt{n}). 2) the family of functions this decomposition can learn Our method can learn smooth functions on a tensor space such as functions belong to Sobolev, Holder, and C-class. This is almost the same as the general results of the nonparametric estimation theory (Tsybakov 2008). 3) experiments with e.g. 5th order tensors or 1000x1000x1000 tensors For 1000^3 tensors, we have obtained a preliminary result that was similar to other results shown in the paper. We will add this in the future version.