Paper ID: 1104
Title: A Box-Constrained Approach for Hard Permutation Problems

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper tackles the quadratic assignment problem (QAP). Recently Vogelstein et al. (2015) considered a (non-convex) relaxation of QAP that requires minimizing a quadratic function over doubly-stochastic matrices (represented by \Theta(n^2) variables). The authors reduce the number of variables to \Theta(n \log^2 n) using the idea of (Goemans 2009, Lim and Wright 2014) to represent permutations via a sorting network. More specifically, the authors represent a doubly-stochastic matrix as a product of \Theta(n \log^2 n) matrices where each matrix corresponds to a potenial swap of two fixed elements and is parameterized by a number in [0,1] (where 1 means "swap" and 0 means "do not swap").  The authors use a continuation approach and solve approximately a sequence of subproblems (starting from a convex one and gradually moving to a concave problem for which the minimum is guaranteed to be integer-valued). For each subproblem a coordinate-descent is employed.  The authors compare this approach to Vogelstein et al.

Clarity - Justification: Well-written.

Significance - Justification: + The representation of permutations via a sorting network sounds as an interesting idea worth exploring.  - The results are not extremely conclusive. In terms of accuracy it's very mixed; on QAPLIB instance the proposed method is better than Vogelstein on approximately half of the problems. The runtime is typically a few times faster (e.g. 5 times), which is good, but not a breathtaking improvement given mixed accuracy performance.  - It would be good to see a more detailed experimental evaluation. There are many differences to Vogelstein, and it's hard to understand the effects of individual components.    My tentative ranking is "weak accept", however I would like to see whether the authors have any further comments about the experimental evaluation before making the final recommendation. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I noticed the following differences to Vogestein et al:  - The submission uses a continuation approach, while Vogelstein does not (if I understood correctly their paper).   - The authors use an additional postprocessing described in lines 660-674. Presumably, a similar trick could have been used in the Vogelstein approach.  It is thus natural to ask whether the difference in the performance could be attributed to these factors. Alternatively, if the Vogelstein's implementation is augmented with these features, would it give better results?  Note, although the authors' representation indeed uses fewer variables (\Theta(n \log^2 n) instead of \Theta(n^2)), a priori it is not clear that the overall method would be better in terms of speed and/or accuracy: - The "regularization" term added by the authors may have a very complicated effect (e.g. bias the solution towards certain permutations that are independent of the input data). In contrast, in the Vogelstein's approach it seems possible to add a "regularization" that a priori would not have a bias towards any specific permutation. - The objective function in the Vogelstein approach is quadratic in input variables, while it is much more complicated with the proposed representation (it is multilinear in each variable, and then squared). Perhaps, this is why the authors resort to a coordinate descent strategy, while Vogelstein uses a Frank-Wolfe method (which presumably may converge much faster). Unfortunately, it is not possible to infer the speed of convergence from the results presented in the paper.  The results on "bur" and "els" problems are drastically different from other results in terms of running time. It would be good to understand where this difference comes from. Is it because of the size of the problems, because of the speed of convergence of inner loops, or something else?   In general, it would be useful to specify the sizes of problems in Table 1.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper introduces a novel method for learning permutations, focusing on the Quadratic Assignment Problem (QAP), which is NP-hard. The method relies on a representation of permutation matrices as products of simple "swap" matrices, and uses a continuous relaxation of each "swap" matrix into a simple 2X2 bistochastic matrix. The paper uses this relaxation instead of the better known Birkhoff polytope. The authors suggest using this relaxation along with a continuation method for a regularization term: The regularization term initially makes the problem convex, but by decreasing the term and then inverting its sign the same term eventually encourages the bistochastic matrices to resemble permutation matrices. Bringing these ideas together, the authors show that the objective can be optimized using simply coordinate minimization with box constraints. Experiments on QAP benchmarks and comparisons to the state-of-the-art show that the proposed method is competitive in terms of performance, and in most cases superior in terms of runtime.

Clarity - Justification: The paper introduces the basic premise and idea clearly. However, some important technical aspects are written in a somewhat confusing manner. In addition, some notational inconsistencies make the reading harder. See in detailed comments below.

Significance - Justification: The paper offers a novel approach to a hard problem: learning permutations, for which there has been recent interest (e.g. the authors cite Fogel et al. (2014), and Lim & Wright (2014, 2016)). The representation of a sorting network in terms of products of simple matrices, introduced in this paper, is novel at least in this context. It will certainly be useful for further applications, and is reminiscent of the decomposition of a rotation matrix into a product of Givens rotations. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. In the proof of Lemma 2.1 there is a reference to the s and t rows of V. s and t should be defined.  2. In the proof of Proposition 3.1, what is meant by f(x) is monotonically decreasing? Which f?  Is f a function of x or of \phi(x) ? And why is it monotonically decreasing?  3. Equation (14) and the paragraphs following it: if I understand correctly, what the authors mean by "multiquadratic objective function" is that while the objective itself is NOT quadratic, the exponent for each "pure" variable is 2.  This means that per-coordinate, the objective is a quadratic function which can be minimized exactly under a box constraint. I think I would have understood this section better had the authors explained this explicitly.  Moreover, the gradient and second derivative are just means to obtain the coefficients of the quadratic objective w.r.t. to a single variable x_k. I found this aspect, along with the use of the term "coordinate descent" to be confusing. In fact, what the authors suggest is coordinate *minimization*: since the objective function, restricted to one coordinate, is quadratic, exact minimization is possible. The coefficients of this quadratic are given via eqs. 15-17, but these are *not* used as descent directions. Rather, since the first and second derivatives of a quadratic determine its coefficients, obtaining the derivatives is a way to uncover the quadratic coefficients. I think explaining these points will make the paper easier to follow for some readers.  4. Lines 566-570: What are Y^{k+1} and X^{k+1}? Seems to be a confusion between S, T and X,Y  5. I would be glad if the authors could discuss how could this method be used for other objective functions, beyond QAP. Much of computational efficiency of the method, as presented, seems to follow from the fact that the specific objective of QAP can be decomposed in a way that allows efficient computation and cyclic updates of the variable coefficients. For other objectives, I doubt that this will always be the case, and I am curious to know what other efficient optimization approaches are possible.    minor comments:  The paper should cite the journal version of Goemens's paper: Goemans, Michel X. "Smallest compact formulation for the permutahedron." Mathematical Programming 153.1 (2015): 5-11.  line 158: I believe the log(n) term should be log^2(n)  line 185-186: There seems to be a typo or missing word here  line 258: typo "the how"  line 863-864: I suggest replacing m by nlogn or nlog^2n

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies permutation problems, i.e. optimization problems w.r.t. permutations. The main contribution of the paper is a new approximate algorithm for the permutation problems. The algorithm is based on the sorting network relaxation proposed by Lim & Wright (2014). The inner loop of the algorithm consists in the coordinate descent w.r.t. variables coming from the sorting network polytope. The outer loop of the algorithm updates the regularization parameter starting from the regime when the problem is convex (or almost convex) and ending with the negative! regularization parameter leading to a concave function ensuring the variables being discrete. The proposed method is evaluated on the quadratic assignment problem (QAP) and compared against the method of Vogelstein et al. (2015). The proposed method is better on some instances and worse on others, but is often significantly faster.

Clarity - Justification: The paper is well-written and the explanation of the material is clear. Some minor comments: a) Lines 184-187. Incomplete sentences. b) Line 302 mentions doubly-stochastic matrices for the first time. It would be helpful to remind what they are, e.g. around equation (3). c) Lines 660-674. It is not clear which statement is made: “The method obtains the permutation with the objective value smaller than the maximum values among permutations in one swap from the global minimizer” or “The objective value cannot be improved by a single swap, i.e. is a local minimizer.” Line 661 suggests the first interpretation, but the further discussion suggest the second one.  d) Line 690. Long line. 

Significance - Justification: The proposed method is interesting and up to my knowledge novel. Lin & Wright (2014), when proposing the sorting network relaxation, were using a general purpose Gurobi solver.  The algorithm allows to obtain solutions of quality comparable to Vogelstein et al. (2015) but often works significantly faster. Such properties suggest that the method might be very useful in practice for larger problems.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I’m a bit confused, what exactly is novel. It looks like the relaxation itself was proposed by Lin & Wright (2014) and the paper proposes a new optimization technique. However, line 852 claims that the relaxation itself is novel. I suggest to either explicitly say that the relaxation is not novel or to discuss differences to Lin & Wright in more details.

=====

Review #4
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes to use sorting networks to relax permutation problems such as QAP. The continuation method is applied to this formulation, i.e., solve a sequence of continuous problems, with increasing negative regularization term, until the problem is concave on each coordinate and its solutions are permutations. (The resulting problems are not convex though, and lead to local optima.) A cyclic coordinate descent algorithm is proposed, which is demonstrated to be empirically faster and competitive with other recent QAP solvers on most instances from QAPLIB and synthetic problems. 

Clarity - Justification: The overall clarity of the paper is good. Introduction is well written, a number of typos (listed above) appear in following sections. Organization is pretty clear. Results seem reproducible, notably due to the simplicity of the algorithm, which does not use too many heuristics, which is an advantage over many other combinatorial solvers. 

Significance - Justification: The use of sorting networks to relax permutation problems is not new (cf. Lim and Wright 2014, cited in the paper), but combining it with a coordinate descent algorithm is a nice contribution, and seems to give good empirical results. It would be interesting for the broader machine learning community to show how this algorithm performs on a larger scale real-world example, rather than on synthetic examples, which, in my opinion, do not provide more insights than QAPLIB examples. Some important questions remain unexplored, nor discussed (cf. below). 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - While cycle complexity is O(nm), it seems that the overall complexity of the algorithm could be exponential since “the algorithm terminates within n! + m coordinate iterations” (prop 3.1). Is it possible to provide better bounds on the number of cycles to perform? This has to be taken into account when comparing to the Birkhoff relaxation complexity. - Could the author discuss stability issues, and guarantees on the retrieved solutions? - Scalability should be discussed, since it seems to be one of the major benefit of the proposed method. - As already mentioned, it would be nice to have a real-life example, in particular one of high dimension, and for which other solvers do not scale.   Typos: - Line 115: directly explain that the regularization term is negative? - Line 135-136: rephrase, confusing - Line 146: relaxations -> plural - Line 159: in complexity O(nm) - Line 185-186: rephrase, missing words… - Line 193: rephrase, missing verb - Line 258: drop “the” after defines - Line 276: drop one “only” - Line 305: do you mean using a recursion? - Line 358: “conincide” -> coincide - Prop 2.2: Proof could be better explained (why we always get a permutation when solving (8) and why solutions of both problems then coincide). - Line 636: “additon” -> addition - Line 660 and below: interesting refinement method, could this be used sequentially as a naïve greedy algorithm to solve the initial problem? - Line 759: rephrase - Line 795: rephrase - Line 846: miss “is”  Summary of review: This article presents a new algorithm to solve permutation problems such as QAP. Empirical results seem to show that it is fast and efficient, but stability and scalability should be discussed. 

=====