Paper ID: 296
Title: Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes several improvements to the block-coordinate Frank-Wolfe algorithm in the context of structural SVM. Specifically, the improvements include non-uniform sampling of blocks (based on the surrogate duality gap), pairwise and away steps, caching of linear oracle calls, and finding solutions for multiple values of the regularization parameter. Experimental results for several structured output prediction problems are shown. Overall, I think the paper is well presented and motivated, and it is interesting from a practical perspective. On the down side, the theoretical analysis does not cover all of the variants: pairwise and away steps are not analyzed, so it is not clear if these provide any benefit in the block-coordinate case.

Clarity - Justification: The paper is well written and the experiments are thorough. The proofs seem to be correct.

Significance - Justification: I think the paper is interesting from a practical perspective. The nature of the contribution is a few improvements to the BCFW algorithm, so in that sense it is not groundbreaking. My main criticism is that the theoretical analysis is incomplete.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): From Theorem 2 it is not clear when non-uniform sampling of blocks gives a faster rate than uniform sampling. For example, the dependence on the number of blocks n is worse here. It would be good to characterize the setting where the non-uniform sampling bound dominates the uniform one.  A convergence analysis for pairwise and away steps in the block-coordinate case (Algorithms D.2 and D.3) is missing. There is a very recent relevant paper that seems to have such analysis: arXiv/1602.01543. In that sense I feel that the paper is trying to cover too much.  Line 70: this is a minor comment, but to be precise BCFW is a *randomized* algorithm, not a stochastic optimization method.   Typos: 459: modification -> modifications 461: method -> methods 480: solving -> solve 518: hood -> good 584: “if convergence” 609: algorithm -> algorithms 621: Than -> Then 641: if -> is 737: undefined reference 757: pairwises -> pairwise 810: controls -> control 864: lambda -> \lambda 1409: w.r.t. 

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper aims at improving Block-Coordinate Frank Wolfe for structural SVM through three different ways: 1. Non-uniform sampling, with weights estimated through linearization duality gaps of previous iterations. 2. Adding pairwise and away steps, which are known to allow linear convergence in the Batch FW setting. 3. Caching popular coordinates for max oracle. The paper separately proposed a method to approximately trace the entire regularization path for structural SVM.  The results are clearly presented with intuitions behind these tricks explained in details. Extensive experiments are conducted to illustrate the performance of different combinations of the approaches on several real-life data sets.  Comparing to its experiments, the theoretical part of the paper is relatively weak. No convergence results are proven for the version with pairwise and away steps, let alone the expected linear convergence as in the batch setting. While convergence results are provided for 1 and 3, the results stayed on the rather generic end and do not actually explains why they work satisfactorily (as explained in details later)  Nonetheless, I feel that given the compelling empirical results of these extensions to BCFW, the paper stands above the bar of acceptance.

Clarity - Justification: Very clearly written.

Significance - Justification: It is practically significant. The theoretical results only says ``these tricks don't break convergence''. They do not actually contribute to the understanding why they converge faster. The "average" is calibrated by the ~7 papers that I reviewed for, which is slightly higher than previous years.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Detailed comments: - Theorem 2 only applies to the case when the exact gap is known in every iteration, while in the algorithm, these gaps need to be estimated from the past iterations. Also, the convergence bound seems to be strictly worse (by a factor of nC_f^{max} which is strictly larger than C_f^\otimes then that for uniform sampling. Therefore, the benefits of non-uniform sampling cannot be seen from the bound. This is not quite the case in the analogous non-uniform sampling in SAG and Coordinate Descent. - Theorem 3 also gives an a hundred times worse bound when parameter \nu is the suggested value 0.01.  And the analysis does not contain much novel idea. The trick of caching on the same BCFW has been studied before (perhaps independently) in Shah et al (2015).  Another minor point: - It will perhaps be better if the main text of the paper is more self-contained, so that the readers do not need to frequently refer to the appendix to find definition of a quantity or description of an algorithm. This is not meant for a critique as the page limit is a hard constraint. 

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper has two mostly independent contributions. The first is many optimizations to the block-coordinate Frank-Wolfe (BCFW) optimization method for structured SVMs. These improvements are built around the fact that the convergence proof is built around showing improvement on the block optimized in the current iteration and taking an expectation of this improvement across all blocks (i.e. the entire problem). The second contribution is an algorithm to compute approximate regularization paths from BCFW with some analysis around the duality gap.  The main improvements are non-uniform sampling of training examples based on outdated estimates of the duality gap for those examples (which leads to more overall progress in each block step provided the estimates are still current enough), pairwise / away steps (reoptimizing using all previously computed dual variables in each block instead of just optimal descent), and caching of oracle calls (which is very similar to the pairwise steps, but without doing new oracle calls).  The regularization path computation builds around the duality gap provided by BCFW, which allows one to compute, given the current solutions to the max-oracle, what is the smallest regularization value which makes the current model epsilon-optimal. Then one can initialize the algorithm with the smallest regularization such that w=0 is an optimal model and iterate.

Clarity - Justification: The paper is reasonably clear. All statements are justified in text and sketches of the most important proofs are provided in the main text.

Significance - Justification: The one issue I have with this paper is that it's fairly specific and the results are not very compelling. The main improvements (caching, pairwise steps, non-uniform sampling) are very similar to previous versions in the literature (and the analysis is somewhat novel), and the regularization path algorithm doesn't seem substantially better than grid search.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the paper seems correct and interesting I worry it might not be appealing enough to the wider community to justify inclusion in this conference.  The improvements proposed in this paper seem to be implementation tricks, most of which are known to work in similar algorithms, which are proven to also work for block-coordinate Frank-Wolfe. It is not clear to me whether block-coordinate Frank-Wolfe is widely used enough that these tricks will be widely interesting, and furthermore the paper does not provide enough evidence that together with these tricks block-coordinate Frank-Wolfe becomes substantially better than the state-of-the-art for solving the kind of learning problem for which this method applies.  The paper would benefit from a comparison with other learning methods for these problems, to better help the reader situate its significance.

=====