Paper ID: 731 Title: Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): == Meta-reviewer here; ignore my ratings, which are not meant to mean anything -- just have a look at my detailed comment to address. == Clarity - Justification: NA Significance - Justification: NA Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): == Fundamental flaw in the convergence analysis ==: The authors should address the following in their rebuttal: it seems to me that there is a fundamental flaw in the current analysis approach. The high level problem is that the distribution over possible mini-batch sets S in line 232 is *not* uniform because of the distributed setting and the interaction with the queue of the server given that different blocks might have difference oracle times; this creates a bias in their analysis that needs to be addressed. More specifically, for the simplest counter-example, consider tau = 1 (i.e. BCFW if it was sequential), n=2 and 2 worker nodes with 1 server node. Say the oracle on block 1 is a billion times slower than the oracle on block 2. Then the buffer in the server is more likely to have updates from block 2 than block 1, thus creating a bias. Again, specifically, consider the first update from the server. There are four possibilities for the samples obtained by the two worker nodes; (1,1), (1,2),(2,1),(2,2) -- if block 2 is chosen by any worker, then this is the block update that will be processed by the server first as it will finish before an oracle call on block 1 always. So with (overall) probability 3/4, the x^(1) iterate will be obtained from an oracle call on block 2, and only with probability 1/4 will x^(1) be obtained from block 1. This means that equation (13) on line 1148 in the Appendix is wrong: it is not the case that E_S|x^(0) \sum_{i in S} g^(i)(x^(0)) = 1/2 g(x^(0)) here; we have instead E_S (...) = 1/4 g^(1) + 3/4 g^(2), which is different [the bias I mentioned above]. Note also here that there was no delay in the parameter used for the oracle call -- it was the correct x^(0) [i.e. < s_S, grad... > - min_s < s, grad... > = 0, so this was not the problematic term]. The authors should crucially address this concern! For similar reasons, I think that the use of Lemma 5 on line 1566 is a bit far-fetched -- I do not think that E_{S-chi_j, ..., -1 | chi_j} Y (in equation (18)) can be really interpreted as just throwing m balls *independently and uniformly* at random into n bins -- there is a dependence on which block we are considering, as well as what is the speed distribution for the other blocks; I do not think you have uniformity. (And anyway, what is the precise meaning of E_{S-chi_j, ..., -1 | chi_j}? It was not defined.) I also think that the distribution of delays for the parameter vector tilde{x} used for a block j (the random variable chi_j e.g. appearing on line 1444) *depends on j* (and also most probably depends on k given the description of the algorithm [truncation for delays > k/2]). So it seems unclear to me what the kappa on line 583 in Theorem 4 is referring to. Perhaps it should be instead max_{k,j} E [chi_j,k ]. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper developed parallel and distributed variants of the Frank-Wolfe algorithm, with block coordinate updates carried out at different compute cores or machines. Convergence of the algorithm is established for moderate assumptions on delayed updates in an asynchronous environment. Numerical experiments are presented on structural SVM and group fussed Lasso, demonstrating improvement over synchronous algorithms. Clarity - Justification: Presentation is very easy to follow. Significance - Justification: The proposed AP-BCFW algorithm extends the work of Lacoste-Julien et al. (2013) to parallel computation with mini-batches. The main contribution is the capability of running the algorithm in an asynchronous distributed manner with theoretical support in terms of the amount of average delays that can be tolerated. The explanation of the n^2 scaling of the constant in Theorem 1 is crucial for understanding of the main theoretical implications. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): For block coordinate type of algorithms, the efficiency of computing the partial gradient is of crucial importance in the overall complexity estimate. For the proposed AP-BCFW algorithm, it would be good to elaborate on this point using more concrete examples, such as linear learning problems, with either sample or feature splitting on distributed machines, and what are the overall computation and communication costs. There are not enough comparison with other distributed optimization algorithms, both in terms of overall complexity of numerical experiments. For example, the naïve implementation of parallel full gradient method or ADMM, which will give more concrete positioning of the real performance of the algorithm. (I am not asking to comparing with more exotic ones in the literature.) Also, the size and dimension of the dataset used is too small and seems meaningless for using distributed asynchronous algorithms. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents parallel variants of the block-coordinate Frank-Wolfe (BCFW) algorithm. Parallelization is studied both within each minibatch and in a lock-free setting (delayed updates). In the minibatch case, the speedup shown in the analysis depends on the degree to which different coordinate blocks interact with each other (this appears as the minibatch curvature constant), which is similar to the situation in parallel coordinate descent analyses. In the lock-free asynchronous case, the speedup is almost linear even when the delay is as large as the number of blocks. This is better than the HOGWILD! and coordinate descent analyses. The derivations are sound as far as I checked. I generally like the paper and think it provides a decent treatment of the problem of parallelizing BCFW. Clarity - Justification: I think that the presentation is a bit messy and can be improved. For example, the division between what appears in the main text vs. the supplementary material could be reconsidered. The shared-memory algorithms appear in the supplementary, but the experiments are all conducted in that setting. Another issue is that the results are split across multiple equations which make them hard to understand. For example, in order to get the rate for the delayed updates case, one needs to plug C into the equation of Theorem 1, then plug delta from Theorem 4, and then substitute c from Equation 12. I suggest you write down the final rates for all cases in a table (the table in D.4 shows only one case). Significance - Justification: I generally like the paper and think it provides a decent treatment of the problem of parallelizing BCFW. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think the name AP-BCFW is not directly related to the algorithms (line 118)? About the sentence (line 690): “This is because large mini-batch sizes introduce errors, which reduces progress per update”. It would be interesting to also test the effect of minibatch without delays. Then see if the delays really matter or if this is simply the tradeoff between single-block - minibatch - batch FW. Otherwise it is hard to understand Figure 1. Minors/Typos: Line 253, fo => for Line 321, low => allow Line 322, S_(i) Line 341, there is no line search variant or Steps 2b and 5 in Algorithm 1 Line 567, remains => remain C in line 499 overloads the notation (previously defined in line 345), better to use a different notation ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a parallel variant of the randomized block Frank-Wolfe algorithm. The method can also be seen as a distributed method (depending on whether the workers are seen as machines in a distributed system or processors in a shared memory system). A rigorous complexity analysis of this method is performed, with insights into how the interdependencies between the data defining the problem influence parallelization speedup. An asynchronous variant of the method is also analyzed, based on viewing this method as an inexact version of the master method. Complexity bounds are obtained under the assumption of the expected delay being finite. Better bounds are obtained under a bounded delay assumption. Structured SVM and group fused LASSO are used as examples of problems to which the proposed algorithm can be applied, and convincing experiments are carried out. Clarity - Justification: The paper is well written. The ideas are presented in a natural order, main results are carefully commented and elaborated upon. The paper contains a small number of typographical issues and exposition issues (all of which can be addressed), some of which are listed below: - I suggest you mention that M is a subset of \R^m when you define problem (1). Currently you do this at the end of Section 1 in notation section. - When you define m_i in (2), write explicitly that m_1+…+m_n = m. Explicitly stress that the blocks x_{(i)} are non-overlapping, forming a partition of \R^m (or that the coordinate blocks form a partition of {1,…, m} )… - L106: expensive hampering -> expensive, hampering - L167: descents -> descent - L177: dependency -> dependence - L217: broadcast -> broadcasts - L282: in the -> on the - L321: low -> allow - L322: s_{(}i) -> s_{(i)} - L322: Reformulate the sentence starting with “The approximation”. Currently it is confusing and grammatically does not make sense. - L327: f^{k} was not defined. I presume this means f(x^{(k)}). Define this. - L327: End inequality (6) with a comma - L330: The word “coins” seems a bit out of place, even though I know what you are trying to say. Use a less colloquial and more precise wording. - Theorem1: where the constant C = -> where C = (or, alternatively, say: where the constant C is defined as …) - Theorem 2: split the sentence into 2. Currently it sounds awkward. - Theorem 3: You did not defined the terms: B-expected boundedness and mu-expected incoherence. It is clear what you mean (since it is possible to guess this), but these terms should be defined before used. - Theorem 3: incoherence. Then, -> incoherence, then - L558: are -> were - L567: remains -> remain - L614: dependency in -> dependence on - The bibliography contains a number of issues: - Always capitalize “Frank-Wolfe”. You almost exclusively write “frank-wolfe” - Paper of Bach: You do not mention arXiv/journal/tech report… - Paper of Bleakley: arXiv ID missing - Richtarik’s name misspelled - Paper of Frank and Wolfe: capitalize journal letters, as in all other journal titles - Paper of Niu et al appeared - Paper of Richtarik and Takac appeared - Paper of Tsitsiklis: capitalize journal More serious / less minor issues will be mentioned below. Significance - Justification: The paper is a very nice addition to the literature on the FW algorithm. To the best of my knowledge, parallel variants were not analyzed before (not in this generality). FW is an important algorithm used in practice, and this development was needed. I think this paper will be influential. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): How do your results compare to those in: - Nesterov. Complexity bounds for primal-dual methods minimizing the model of objective function, CORE DISCUSSION PAPER (2015/3), 2015 - G. Lan, “The Complexity of Large-scale Convex Programming under a Linear Optimization Oracle“, technical report, Department of Industrial and Systems Engineering, University of Florida, June 2, 2013, updated in June 2014, submitted to Journal of Machine Learning Research, under revision. A comparison is needed so as to ensure that these papers do not overlap, or indeed contain, your results. My evaluation is based on the assumption that there is no or little overlap, but you need to check. I will need to check before a final decision is made. Here I include some additional comments and remarks: - L323: Define delta. This will explain what you mean by “additive”, and will avoid the current confusion with seeing delta in (6) in a multiplicative way. - L332: The sentence “and any other source of uncertainty…” is unexpected at this point, since not much was said about asynchronicity and how it appears in the method. Hence, it is quite unclear how this enters inequality (6). I think it is best to interpret (6) as taking expectation wrt to the randomness generated by the random choice of minibatches in all discussion prior to Theorem 1. Subsequently, you can add the additional interpretation, in Section 2.3. This will be less confusing and more clean way of doing this. - Theorem 1: What re steps 2b and 5 in Alg 1? I can’t find any such steps! - Theorem 2: Line–search variant is mentioned here as well. Where can this be found? Not in Alg 1… - L615: Give a more detailed proof of the inequality. You say it follows from Jensen’s inequality, but this does not seem to be immediately obvious. Is this correct? Perhaps add a lemma and a proof in the supplementary material. - FW was used with success for the sparse PCA problem. This is a key ML problem and hence some references would make the paper more useful to the ML community. - Ahipasaoglu and Todd have a sequence of papers on FW, just before the current wave of activity. I suggest that some references to these “early” works be added, for a more complete bibliography. - Assumption (8) appears in the modern analysis of randomized parallel coordinate descent: [x1] Bradley, Kyrola, Bickson, Guestrin. Parallel Coordinate Descent for L1-Regularized Loss Minimization. In ICML 2011 [x2] Richtarik, Takac. Distributed Coordinate Descent Method for Learning with Big Data. JMLR, 2015 - You compare against parallel coordinate descent under a different assumption (partial separability) in Section D4. This makes the results for CD worse – since the new results improved on the previous results of Richtarik and Takac. For the most modern treatment of ESO inequalities for parallel coordinate descent under assumption (8), see [x3] Qu, Richtarik. Coordinate Descent with Arbitrary Sampling II: Expected Separable Overapproximation, arXiv:1412.8063, 2014 =====