Paper ID: 675 Title: Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents both synchronous and asynchronous algorithms for decentralized optimization in the case of pairwise functions. The algorithm is based on decentralized dual averaging, and is shown to have theoretical guarantees. Experiments are presented for AUC maximization and metric learning, but could be improved. Clarity - Justification: The paper was written well and easy to follow. Significance - Justification: This paper is in an important and growing area, and tackles a difficult problem. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In general I liked the paper. The problem setting is of practical and theoretical significance. The algorithm is interesting and novel, and the proof of convergence is non-trivial. My main concern is with the experiments, which were underwhelming: -There seems to be some overlap with the work of Colin et al., particularly as they are also performing synchronous and asynchronous decentralized optimization for pairwise problems such as AUC maximization. There should be comparisons or more discussions in the experiments with respect to this existing algorithm. It is not clear how the two compare. -The metric learning experiments were performed on simple synthetic data. A more realistic experimental setup would consider real data sets. Can the work of Colin et al. be applied in this setting? -The AUC maximization setting is performed only on a single real data set. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper considers minimization of a function that is the sum of terms, each of which contains a pair of data points. There is one term for each pair, i.e. n-choose-2 in total. The paper proposes an algorithm that is, essentially, a combination of gradient descent and gossip-style exchange of information among “nodes” - each data point is a “node”. Clarity - Justification: This paper would have been much clearer, and the results easier to appreciate and evaluate, if the terms in Proposition 1 and Theorem 1 (it seems these should be Theorems 1 and 2, based on the text ?) had a clearer explanation, maybe via simple special cases of graphs/functions. E.g. a simpler result could involve just choosing a fixed step size rule instead of a general \gamma(t), a particular e.g. graph etc. Significance - Justification: The intended setting is a bit mystifying: why would one employ these kinds of gossip style algorithms in settings where data is in memory, or even if it is in the memory of several machines. There is no situation where each data point is in only one node, and all communications between such nodes are equally expensive. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper has some nice analysis, but is lacking in a motivating example of why such a setting and algorithm would be needed in the first place (over competing paradigms like distributed stochastic gradient descent etc.). The theorems need to be made clearer, possibly at the expense of some generality, by choosing and focusing in on what the authors feel would be the settings where their method would perform best. Overall, this is a decent theory contribution that may have implications (although these are not detailed) for how data should be distributed and communication coordinated for some ML optimization problems. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a distributed algorithm for the minimization problem \min_\theta \sum_{i,j} f(\theta, x_i, x_j), where f is convex and each node has constant memory. The paper relies heavily on the work Duchi et al, who described a similar algorithm for the case where each term in the objective depends only on variable x_i, and the work of Colin et al, who described gossip algorithms for estimating statistics that depend on pairs of observations. Clarity - Justification: The paper is exceptionally well-written. There is one aspect, however, that I didn't understand: Not being very knowledgeable about gossip protocols, I didn't understand the significance of swapping exactly one pair of auxillary variables per (outer) iteration. Wouldn't the algorithm converge faster if you swapped more than one pair? Or do you swap only one pair because that is easier to analyze? Also, would the entire scheme fail for an objective that depended on a strict subset of all pairs of examples? Because in that case after each swap a node may be holding two variables that don't appear together in the objective. A discussion that addressed these questions would be most welcome. Significance - Justification: This paper extends the work by Duchi et al in two significant directions: (1) allows each term in the objective to depend on a pair of examples, not just one, and (2) lifts the requirement that the updates are synchronous across nodes. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Minor comments: - Please define the graph Laplacian. - There is a Proposition 1 and a Theorem 1. I think you want a Theorem 1 and a Theorem 2. =====