Paper ID: 400
Title: Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper is proposing a unify framework for factorization machines and polynomial networks. The proposed methods are numerically obtained by performing a coordinate descent algorithm (on multi-convex objectives functions). This enables the authors to handle high order of interaction in regression problem, for instance up to third order.

Clarity - Justification: The paper is particularly clear and well written. Few typos and small issues are raised below, but they could be easily corrected. Though, to improve the reading experience, a bit of details could be added in the discussion part.

Significance - Justification: The unifying framework is a very neat contribution from the authors. I find this particularly important for different communities (neural network, high order regression, kernel methods, factorization matrix). The learning-rate-free aspect of the proposed algorithm is also a very appealing property.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): l098: the weight - > the weights?  l129: a a - > a  l171: harmonize the dots (cdots,ldots...)  l201: we are first - > we are the first  l221: pj - > pj,  l283: in this section, - > remove this.  l300: please double check the fact that the \langle . \rangle_ > is an inner product. for m=2, and d=2 one has \langle x^2, x^2 \rangle_ > = x_1^2 x_2^=0 iff x_1 =0  or x_2 = 0. An inner product should have the solution x_1=x_2= 0 only. Anyway, it does not really matter since this property is never needed in the rest of the article.  l385: add a "," before \forall.  l397: c.f. - > cf. (idem l491, l655)  l468 SGD - > stochastic gradient descent?  l595: Please remind what you call here "objective function".  l618: I disagree the rank of the 3 x 3 matrix with 1's on the upper diagonal is 2, while its symmetric version has also rank 2. please correct this part.  l627: a link to the dataset used could be provided.  l642: is the loss function simply the square loss here? please clarify, and may be also remind the ridge formulation considered.  l702: the choice r = k / 3 seems ad hoc. any more reason why this is "fair"? same remark later (l857).  l710-786: I do not understand the generality of the remark. I understand it for features with 0/1 values. possibly one has never observed cases with both x_i and x_j = 1, so inference on a case x_i=x_j=1 will not happen in the kernel context. But if the variables are continuous I am missing the benefit. Can you please improve this section, since it seems to be a nice property of your method.  l844: NLP - > natural language processing ?  Appendix:  l1723 : is this coming from Lemma 3? please add details.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this submission, the authors study polynomial networks and factorisation machines. The authors outline the existing direct approaches which explicitly take outer-product in multiple modes and therefore avoid so-called kernel curse. These approaches can learn a corresponding factorisation in a place of a higher-order decision matrix W. The authors outline both polynomial and Anova kernel and propose so-called lifted approach which proposes an augmented dot-product for Anova kernel. The main contribution however is the multi-convex formulation in which the authors modify the factorisation of W (second-order case) or higher-order tensor case using the sum of rank-one tensors. The main outcome is that the proposed approaches are convex in each variable so alternating is an apepaling solver for such a class of algorithms.   Strengths: - interesting theoretical frameworks the authors work with - interesting approach to overcome the 'direct approaches'  Weaknesses: - perhaps somewhat unnecessarily complicated notation while the main contributions are really hidden in section 5.2 and 6 - the experimental section could be perhaps stronger 

Clarity - Justification: Reasonably clear paper but the authors should simplify notifications and make clear distinction between their contributions and prior work.

Significance - Justification: Learning generalisations from higher-order statistics is a relevant problem.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.):   Major comments: 1. The authors make a claim/suggestion somewhere in the submission that the proposed approach can scale gracefully to orders r > 3, e.g. 4th, 5th, etc. However, this is never evaluated. What complexity would this algorithm enjoy in case of fifth-order statistics?  2. The discussion on benefits of Anova versus polynomial case feels somewhat incomplete. What are the cases in which Anova would be clearly a better performer than polynomial-approaches and why?  3. The experiments in figure 2a  seem convincing but the direct approach seems to be converging comparably fast. The curve is cut-off at some point but if continued, perhaps the convergence would be similar to the proposed approach. Can authors provide any clear cut support for the alternation-based optimisation versus non-convex optimisation? After all, is alternating over convex subproblems guaranteed to convergence to a better minimum than a non-convex problem and if so, what are these guarantees exactly? If the reviewer is not mistaken, it is known that for the locally smooth functions, both problems should converge to their local minimum.  4. Also, is it possible to add the direct approach in Figure 2b and Figures 3a and b?    Overall, the proposed approach is somewhat relevant in learning generalisations on higher-order statistical approaches. Therefore, the reviewer is reasonably positive about the proposed methodology, but ideally the experimental section should be strengthen to highlight the benefits of proposed extensions.

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a framework that combines factorization machines and polynomial networks. Using polynomial and ANOVA kernels, the paper explains how factorization machines and polynomial networks are special cases of the proposed framework. The framework leads to an efficient optimization method based on tensor factorization. Interesting empirical results and analysis are included.

Clarity - Justification: This paper is nicely written. New framework is clearly presented and connections to factorization machines and polynomial networks are well explained. Organization is easy to follow.

Significance - Justification: The paper presents an interesting framework based on a kernel-based interpretation of factorization machines. Factorization machines have recently been successful in a few domains including recommender systems. The kernel-based interpretation is nicely connected to a tensor-based view, leading to an efficient optimization method.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The new framework and optimization algorithms provide a simplified view on factorization machines. A nice alternative view.  The symmetrization trick is effective since it makes the problem to be multi-convex. On the other hand, in order to make the parameter size comparable, the half (or a third, etc) of the intended rank is used with the symmetrization trick. I wonder if this trick places a limitation in the space of parameter set. The question is, can every symmetric matrix of rank k can be represented as a sum of a rank k/2 matrix and its transpose? If the answer is no, what is its implication? 

=====