Paper ID: 612 Title: Distributed Clustering of Linear Bandits in Peer to Peer Networks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a distributed algorithm for linear bandits and its extension that finds clusters of bandits. Clarity - Justification: In general the paper is clear, but there are several points that are not perfectly explained. Significance - Justification: I would not call it a "major breakthrough", but it is an interesting result. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the paper is interesting. It is very technical and I did not have time to check all the proofs in details. I have a few questions I would be glad to have clarified in the rebuttal: 1. One of the central challenges in the paper is handling the bias that originates from imperfect information sharing. I must admit that I did not understand why imperfect information sharing introduces a bias. I think it will be good if you explain this in the paper. If we are talking about your DCB setting, the agents are solving the same linear bandit problem, so information from other agents just adds additional information about the same problem. I do not see where the bias comes from. 2. Is it crucial to use a random permutation for information sharing? Wouldn't random choice of partners (not necessarily a permutation) yield a similar result? 3. In your first figure in the experiments instantaneous information sharing performs worse than no information sharing and in the last one the picture is opposite. Is it because of the cluster structure? 4. It is not very clear to me why the shape of DCCB and CLUB graphs in the experiments are so similar... I would expect randomization and delay to drive them apart... 5. It is not absolutely clear whether the agents are exchanging their latest estimates or complete buffers. The meaning of averaging buffers in Line 223 is also not absolutely clear. 6. You provided no explanations of your estimation of communication complexity. It would be helpful if you do, at least in the appendix. 7. Your definition of \tau in Line 277 is confusing (and the notation is confusing) and does not agree with the definition of \tau in Line 677. Other minor comments: - Your example of a sensor network that "needs to spread data over a large area" does not match your algorithm, because in your algorithm all agents eventually communicate with all other agents and so they have to communicate to their distant peers. - Line 94: I think that i^* notation would be cleaner than {i,*}. - Line 118: realize -> implement - Line 120: delete "been" - Line 172: delete "improvements in the" - Line 201: The definitions A^i = A^i_t and others of the same sort are confusing - Line 207: Consider revising this sentence - Footnote 1: give at least one example how, at least in the appendix - Line 248: I did not understand why \sum_{i'} w_{i,t}^{i',t'} = |V|... And, more generally, what are these weights... - Line 335: "are be" - Line 338: "this is only an issue ..." - what is an issue? - Line 528: It should be a weak inequality, and the same in Line 529. - Line 537: The first is a weak inequality and the second is an equality. And the continuation also requires corresponding correction. - Line 559: c = c_\lambda^{tresh} - confusing ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors consider a setting where multiple agents are solving a linear bandit problem. On each round t, each agent i is given a set of actions D^i_t, and must select some action x^i_t. Selecting x^i_t produces bandit feedback from a linear model. What makes the problem unique is that the agents (1) are provided feedback from the same linear model (i.e. the unknown parameter \theta of the linear bandit problem is shared), (2) at each time step, an agent may share information with some other agent chosen according to a random permutation. The solution is to have the agents execute a distributed version of the confidence ball algorithm. In the non-distributed setting, running the confidence ball algorithm requires maintaining statistics (A,b), allowing one to compute a least-squared (L2 regularized) estimate of \theta. Ordinarily A is a covariance matrix A = sum_t < x^i_t, x^i_t >, and b = sum_t r_t x^i_t. In the distributed setting, each agent maintains a buffer corresponding to A and b. At each time step, these buffers are shared with a random other agent. Upon receiving a new buffer, it is averaged with the agent's current buffer and augmented with the next observation ( < x^i_t, x^i_t > for the A buffer, r_t x^i_t for the b buffer). The agent also maintains an "active" estimate of both A and b. After a long enough period tau, an element can is moved off the buffer and incorporated into the agent's active estimate. The authors prove that the resulting algorithm loses a factor sqrt{|V|} in its regret, compared to solving each problem independently (without sharing information). They then leverage this analysis to solve a distributed version of the clustered linear bandits problem. In the clustered linear bandits problem, agents do not know whether they are solving the same linear bandit problem. Agents sharing the same \theta are said to belong to the same cluster. They derive regret bounds and demonstrate comparable performance to the centralized algorithm on a number of tasks. Clarity - Justification: The exposition is clear, and the proofs are reasonably clear given the technical detail. I have some concerns, however. First, it's never made clear whether the gossip protocol should be thought of as part of the model (exogenously fixed by nature), or if it should be treated as a choice of the algorithm. If it's exogenous, then this needs to be motivated better. What are some motivating examples where agents are collectively solving bandit problems while tied to this gossip protocol. If it is not exogenous, then one could imagine other protocols with the same communication complexity, but that would simplify the problem. For example, each agent could send its information in a round-robin manner, or to the least updated agent. Significance - Justification: Consider the following natural algorithm. Every time an agent i pulls arm x_t^i, receiving reward r_t^i, it creates a tuple (x_t^i, r_t^i, i, t). Every agent maintains a set T_i of every tuple it knows about. Agent i augments T_i whenever it tries a new action, or whenever it receives T_j from some other agent j (updating T_i = T_i union T_j). At the beginning of each time step, each agent i computes its confidence ball around \theta using all the available information in T_i. The communication complexity of such an algorithm is of the same order as the provided algorithm: O(|V|log(|V|^2 t)). Moreoever, its analysis seems like a straight-forward extension of CB (there is no "doubling" of data points or bias to track). Every tuple, no matter who pulled it, only improves the agent's confidence ball around \theta. Can the authors comment on the feasibility of such an approach? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I like the setting as a theoretical exercise. I think the proofs are interesting, and is an interesting analysis of the confidence ball algorithm when the covariances are biased by arbitrary (but controllable) weights. I have two major concerns mentioned in the rest of the review. First, what motivates tying the algorithm to this particular protocol? Is it an application where some other protocol (with the same communication complexity) would be unreasonable? Second, there seems to be a more straight-forward solution to this problem even when tied to the particular gossip protocol. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present two distributed algorithms for solving linear bandit problem in peer to peer networks using specific type of communication known as a gossip protocol. The first algorithm is designed for a situation where all the agents solve the same linear bandit problem. The second algorithm focuses on a situation of several clusters with different linear bandit problems. Clarity - Justification: The problem is clearly explained, however it would be better to improve the motivation and explain better the reason of using multi-agent linear bandit in peer to peer networks. Furthermore it is not entirely clear why it is desirable or necessary to use clusters of different bandit problems. Significance - Justification: I believe the results are of reasonable quality. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors provide extensive proofs of improved regret bounds which are then supported experimentally. The datasets used for the experiments are lacking detailed explanation, it is not quite clear how they differ from each other and what is their quality value. The paper does give some novel approaches of using the gossip protocol in distributed setting and improve existing regret bounds. Minor: it woud be helpful to increase the font size in the plots. =====