We thank the reviewers for their careful reading and helpful comments. We are glad that all of them recommended acceptance. We will take their remarks into account to improve the quality and clarity of the manuscript. Below, we provide answers to the questions of each reviewer.$

** Reviewer 10 **

* Swapping strategy
In the asynchronous setting, a single edge of the network must be active at each time step (see lines 520-526), hence the single swap. In the synchronous setting, one could swap more than one pair at each iteration and indeed get an improvement in the convergence rate constants. We did not perform this analysis because the synchronous setting mainly serves as a simplified framework for the analysis of the asynchronous setting. It provides an easier-to-interpret convergence rate but may not be very practical, due to the global clock hypothesis. We will add a remark to clarify.

* Strict subset of pairs
It is usually possible to take advantage of the structure of the strict subset of pairs involved in the objective: for instance, one could attach some additional concise information to each observation so that a node can easily identify whether a pair contributes to the objective, and if not set the loss to be zero. This is essentially the case in AUC optimization, where pairs of similarly labeled observations do not contribute. If the subset of pairs cannot be formulated in such a compact form, then one should provide each node with an index list of active pairs, which could be memory-intensive. We will add a discussion of this question.


** Reviewer 11 **

* Difference with Colin et al.
The goal of the paper is to provide a method to efficiently optimize pairwise functions of data in a setting where simply computing such a function is already a challenge. Indeed, Colin et al. (2015) proposed an algorithm to compute the function (or its gradient) for a fixed parameter, but it cannot be used to optimize the function. For instance, in their paper, they use their algorithm to compute the AUC score for a fixed \theta, while our algorithm finds the \theta which maximizes the AUC score (more precisely, a convex surrogate). Similarly, in the context of metric learning, their algorithm could be used to compute the value of the objective for a fixed metric, but not to find the optimal metric like we did in our experiments.

* Additional experiments
We performed additional experiments on two real datasets for both AUC maximization and metric learning, with similar results as those shown in the main text. We will add these experiments to the supplementary material.


** Reviewer 8 **

The setting of interest in this paper is quite specific: (i) one cannot choose the way data is spread over the network as it is naturally distributed (for instance because data is collected locally at each node), (ii) there is no central machine to coordinate the computation or aggregate the results of each individual node, and (iii) more generally one has no control over the network topology (nodes can only communicate when some specific requirements are met, say two sensors are close enough). Such strong constraints are commonly found in many applications (e.g., sensor networks, mobile phones, Internet-of-Things, connected cars) and make standard distributed optimization methods unsuitable as these must rely heavily on a central coordinator machine. In contrast, gossip protocols are well suited to this setting because they can model both the uncertainty of the communication links and the underlying asynchrony, and are usually simple enough to be run by computing devices with limited CPU and memory. We will improve the introduction in order to better highlight these crucial characteristics.