We thank all reviewers for their thoughtful comments. Below we respond to specific concerns.$ Reviewer3 - Motivation for non-exogenous gossip protocol The sharing protocol we proposed is a choice of algorithm design. We will make it clearer in lines 72-98. Gossip algorithms have had great success in other areas of distributed learning and computation (see e.g., [1], [2], [3] and [4]). Furthermore, in our setting we were able to prove excellent regret bounds, while keeping per-round communication costs very low for large networks. Specifically, communication complexity is only log-linear in network size |V|. Moreover, we show that the use of a delay that is only logarithmic in both time and |V|, together with the naturally rapid diffusion of the data under our gossip protocol, gracefully mitigates the bias (i.e., the deviation of the weights ‘w’ from 1/|V|) coming from not tracking the sharing history of agents. - 'Natural Algorithm' (Randomised sharing) Keeping track of, and thus avoiding doubling observations is appealing, but its per-round communication complexity is at least quadratic in |V|, whereas ours is linear. The reason for the former is that agent i, when communicating with agent j, needs to know for each k which are the latest tuples at agent i from agent k. The communication cost of this is of order |V|. Since every agent shares information with somebody in each round, this gives per round communication complexity of order |V|^2 in the network. We improve this by not keeping track of the individual observations that any agent has gathered, but rely on a diffusion instead to even out the distribution of data across the network. Each agent shares A^i_t and b^i_t at each time step, which are only of logarithmic order in |V| and time. From a memory perspective, storing the entire T_i sets, as suggested, would be extremely cumbersome. Our memory cost is only logarithmic in time and network size. - Round-Robin (R-R) (Deterministic sharing) R-R protocols are typically designed for static and complete (in graph theory sense) networks. They do have low communication complexity, the same as our gossip-based protocol. However, R-R protocols introduce a longer delay than ours. It is linearly dependent on |V|, rather than the logarithmic dependence that we achieved. At any time, each agent will be lacking |V|(|V|-1)/2 observations. Analysing the result of this delay on the cumulative regret bound still requires the use of our Proposition 2, a key contribution of our paper. Using this proposition we can derive a regret bound for R-R in the non-clustering case that has the same asymptotic dependence on |V| as our protocol, but with an additive constant that is worse by a multiplicative factor of |V|. For large networks, it’s a big difference. Another important reason to consider gossip-based algorithms over deterministic sharing is that gossip-based algorithms are immediately implementable on non-complete, dynamic networks. It is straightforward to use tools from [5] to extend our arguments to this case. Implementing these algorithms in sparsely connected networks is a key ingredient in making them truly scalable to large networks. Reviewer5 First, thanks for the minor comments. We reply to your other comments as well: 1) The bias coming from imperfect information sharing is summarised in the deviation of the weights ‘w’ from their limiting values of 1/|V|. It occurs because, rather than information being passed around so that every agent has at most one copy of each observation at any time, the information is allowed to diffuse across the network. We will explain it more in main text. 2) Independently randomly choosing neighbours instead of using a random permutation is not needed in practice. The random permutation is required for our proofs, however using techniques from [5] should overcome this. 3) We will explain the figures more. MovieLens corresponds to the hit scenario (few clusters), Delicious corresponds to niche (many clusters), and LastFM is somewhere in between. 4) The randomisation and delay have a negligible effect, although slightly different clustering processes can be seen in the mild initial variation between DCCB and CLUB. 5) They exchange complete buffers. 6) We will include a more detailed discussion along the lines of the response to Reviewer3. 7) Line 277 is the correct one. We will make the usage consistent. Reviewer6 - We will add a detailed explanation of the datasets. - Motivation for clustering the bandits pls see e.g., [6]. - We will increase the font size. [1] Gossip-based Learning under Drifting Concepts in Fully Distributed Networks [2] Peer-to-Peer Multi-Class Boosting [3] Adaptation for the masses: towards decentralized adaptation in large-scale P2P recommenders [4] A P2P REcommender system based on Gossip Overlays [5] Gossip-based computation of aggregate information [6] Online Clustering of Bandits