We are grateful to the reviewers for taking the time to review our paper and$their constructive comments. Please find our answers below. * Assigned_Reviewer_2: - please double check the fact that the \langle . \rangle_ > is an inner product. [...] Anyway, [...] this property is never needed in the rest of the article. Indeed, since this property is not used, we will just remove the claim. - regarding choice of the rank In the paper, we set r = k / m, where r is the rank before symmetrization and k is the rank after symmetrization. For the case m = 2, where m is the degree (i.e., case of matrices), basic rank inequality gives r <= k <= 2r. Setting r = k / 2 is thus the most pessimistic choice for our lifted approach. In comparison, setting r = k would have given an unfair advantage to our lifted approach. The choice r = k / 2 is also confirmed in practice: because UV^T is typically not symmetric for local minimizers of Eq. (11), the rank of S(UV^T) doubles in practice. For the case m > 2 (case of tensors), we don't know inequalities that relate r and k but we believe the choice r = k / m is fair in terms of model size (number of floats). - l710-786: I do not understand the generality of the remark. I understand it for features with 0/1 values Indeed, our discussion is most relevant for binary features because monomials can then be interpreted as co-occurrence of features. We will clarify. - Thanks a lot for all other comments, we will address them. * Assigned_Reviewer_4: - The authors make a claim/suggestion somewhere in the submission that the proposed approach can scale gracefully to orders. [...] However, this is never evaluated. The cost per epoch of our training algorithm for polynomial networks scales linearly with respect to the degree so we believe it should scale well. We had to make choices due to space limitations but we plan to demonstrate (much) higher-order feature combinations empirically in an extended version of the paper. - The discussion on benefits of Anova versus polynomial case feels somewhat incomplete. What are the cases in which Anova would be clearly a better performer than polynomial-approaches and why? The polynomial kernel is better for function approximation than the ANOVA kernel since it uses all monomials, including power features. However, the ANOVA kernel can have better generalization guarantees when the power features are clearly not useful, as is the case for binary features. However, this is a difficult question in general and we will definitely need more experiments on diverse tasks to answer this question. - Can authors provide any clear cut support for the alternation-based optimisation versus non-convex optimisation? [...] If the reviewer is not mistaken, it is known that for the locally smooth functions, both problems should converge to their local minimum You are absolutely right. However, although we don't have any theoretical guarantees to support multi-convex problems over non-convex ones yet, we believe multi-convex problems can be more convenient in practice for convergence monitoring and automatic choice of step size (by using Lipschitz continuity). Also for coordinate descent (CD), multi-convex objectives are a much more natural choice (we have never seen CD used for non-convex non-multi-convex problems). - Also, is it possible to add the direct approach in Figure 2b and Figures 3a and b? We don't plan to add the direct CD approach to Figure 2b because polynomial networks are not multi-convex. We will add a direct vs. lifted comparison on recsys data to the appendix. * Assigned_Reviewer_5: - Can every symmetric matrix of rank k can be represented as a sum of a rank k/2 matrix and its transpose? For the cases of matrices, yes, because r <= k <= 2r. For the cases of tensors, our choice r = k / m is ad-hoc but is fair in terms of model size (number of floats). Thanks again to all reviewers for their constructive comments.