Paper ID: 766 Title: Gaussian process nonparametric tensor estimator and its minimax optimality Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a nonparametric model for tensor regression problem and provides rates of convergence analysis. Clarity - Justification: Well-written and easy to understand. Significance - Justification: Very relevant to modern tensor data problems. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper proposes a nonparametric model for tensor regression problem. The regression function is assumed to be a sum-product of one-dimensional functions. This is possible by assuming a low-rank structure on the input tensor. This structure provides improved rates of convergence (in terms of posterior convergence) compared to the naive way. The theoretical results are supplemented with simulation and empirical results on real world data set. The results of the paper are complete and the presentation is clear. The posterior convergence rates are proved under standard assumption made in the literature and they are also supplemented by matching minimax lower bounds. On the whole the paper is a complete study, but the novelty aspect is the main drawback - one would hope to get the results obtained in this paper for the model considered using the standard proof technology. Typo: line 436 nive => naive ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper extends linear tensor learning to a nonlinear one via common RKHS arguments coupled with a Bayesian learning procedure, with Gaussian process priors on regression coefficients of the basis functions (corresponding to a specific RKHS). Clarity - Justification: The paper is clearly very difficult to read, and took me a lot of time to digest. The poor quality is partly because the notation is too heavy, partly because there are too many symbols not introduced clearly, and partly because of the poor organization of the results. For example, I fail to find the introduction to definitions of the prior distribution $GP(;\lambda)$ and the norm $\|\cdot\|_n$, which are key in understanding the procedures and theory. Significance - Justification: The results look solid, but as I said before, need to be presented more clearly. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I believe the most serious issue about this paper is with regard to the presentation. The authors simply assume that the readers are experts and familiar with this area, and even familiar with the notation system in related nonparametric regression problems. This is unacceptable. In addition, I have several comments: (1) I would love to see more discussions in Section 3.3. For example, how is the computational efficiency for the Gibbs sampler in this specific case? How fast can it converge? What will happen if the dimension K is very large? (2) This type of thinking is similar to deep learning in the sense that the model is first over-parametrized, and then distilled. I was hence wondering if the authors could provide more thinkings on comparing these two research directions. (3) I do not wish to overemphasize it, but I do wish the authors to think about the bias-variance tradeoff. The rate of convergence obtained in this paper is clearly much slower than the parametric rate if the parametric assumption (e.g., linear assumption) holds. Hence, a bias-variance tradeoff exists: when the parametric modeling bias is large, the proposed procedure wins; otherwise, it is unclear. Hence, it appears to me that model validity check is important before any very complex model is applied to the real data. And we need to be very careful in employing such complex models. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Th article consider nonparametric function estimation via expanding the unknown function in a tensor product basis and equipping the tensor arms with Gaussian process priors. Minimax optimality of the resulting procedure is established. Several applications to multi-task learning are provided. Clarity - Justification: The paper is nicely written overall. Typo at the bottom of page 4: nive --> naive Significance - Justification: The minimax optimality result is novel. Estimating a function of many variables is difficult due to the curse of dimensionality. Using a low rank tensor decomposition is a natural dimension reduction technique. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some recent papers have established improvement in rates when the true function has more structure, e.g., depends only on a few coordinates, has anisotropic smoothness or more generally the covariates lie on an unknown manifold. See for example, Bhattacharya et al. (2014) [AoS 2014, Vol. 42, No. 1, 352–381] and Bayesian manifold regression by Yang and Dunson (2015+) for some recent work on GP regression. Some clarification posing the contribution of the article relative to existing literature will enhance the scope of the presentation. =====