Paper ID: 1115 Title: Scalable Discrete Sampling as a Multi-Armed Bandit Problem Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper considers the problem of sampling from a discrete distribution over D values, where the probability p(i) of outcome i is a product of several factors f_n(i), for n = 1, ..., N. The naive algorithm would take O(ND) time. This paper re-casts the problem as a maximization, and further, by treating each i as an arm is able to apply best-arm identification algorithms from the MAB setting. The goal is to use these existing MAB algorithms to allow scaling to large N. The transformation to a maximization problem is done by associating each i with the value log p(i) + a Gumbel-distributed random variate, maximizing which gives an i distributed according to p. Since log p(i) is itself a sum of log f_n(i), with some arithmetic the f_n(i) (plus the extra Gumbel variate) for n = 1, ..., N can be treated as a sequence of rewards associated with arm i. Unlike in traditional bandit settings, the rewards for arm i are not iid, but rather are sampled without replacement from a population of size N. The main result of this paper, therefore, is to adapt two families of MAB best-arm algorithms (lil'UCB and Racing, with two variants of the latter) to reward sequences that are drawn from finite populations (per arm) without replacement. The hope is that when sampling the first few f_n(i) for different i already shows a large gap between the likelihoods of different i, then there is no need to sample all the remaining f_n(i) for the obviously bad arms, thus potentially taking much less than O(ND) time. Clarity - Justification: This paper is exceptionally clear and well-motivated, especially in the introduction of the problem, transformation to best-arm identification, and adaptation of lil'UCB for the setting. Thereafter, while the paper is still quite strong, I would have appreciated more discussion on certain topics. Most importantly, in section 3.3.1 a Serfling concentration bound is used to find an appropriate action-elimination gap function G for the Racing algorithm. It is then shown that with this choice of G, the algorithm is "correct" in that it doesn't need more than ND samples. Here, it would be useful to have a discussion of whether/why stronger assumptions are needed to bring it down to o(ND) samples (which would indeed be better than the naive sampling algorithm). The next section does indeed make such stronger assumptions and shows the advantage if the gap between arms is large (in Prop. 7). Significance - Justification: This work brings together several interesting ideas and takes them in a useful direction. Even disregarding the application to discrete sampling, the adaptation of MAB algorithms to finite populations of reward values is intriguing, potentially taking advantage of the good finite-time guarantees that many such algorithms possess. The application of these ideas to the discrete sampling problem is well-motivated, but would benefit from more discussion. Given that the naive sampling method uses O(ND) time, a summary of how different assumptions on the f_i give asymptotically better performance would be good to see, to supplement the empirical results. (This is partly addressed for the Racing-Normal algorithm). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): On the whole, I think this work is significant and novel. I found this way of applying MAB algorithms to discrete sampling quite elegant, and on the whole, fairly well-motivated. If there were more concrete asymptotic results for the number of samples used, I think the idea of adapting MAB algorithms to fixed reward populations sampled without replacement would even stand strongly on its own. Here, it becomes somewhat incidental to the goal of discrete sampling. I do not consider this a problem so much as potential future work. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper develops three algorithms for approximate sampling of a discrete distribution. The paper considers the case of a small state space, but large statistical dependency (large numbers of factors). The algorithms exploit the recently explored Gumbel-Max trick, which frames samples as a discrete maximization problem. Crucially, the maximization problem can be expressed as a multi-armed bandit problem. Thee techniques could be used as subroutines in the context of MCMC algorithms (e.g. the conditionals in Gibbs sampling). Clarity - Justification: I found most of the paper easy to follow. Although I am not an expert on bandit algorithms and may have misunderstood parts of section 3. The related works section could use an extra sentence of intuition regarding the connection to subsampled MH. That connection is crucial to the majority of the paper and is easy to miss. It might even be useful to separate that into it's own subsection just to emphasize it. The experiments section could be made clearer. I suggest clearly displaying which the factors to which the approximate sampling algorithms are being applied, and some more detail regarding the MCMC outer loops (augmented MCMC, annealed Gibbs sampling). I don't think all the details are necessary, but enough to grok the way in which the algorithms are being used. Significance - Justification: The authors did a decent job of justifying the work. Especially the application to subsampled MH is worthwhile. I am not qualified to comment on the significance of the experiments chosen, but I can easily see how this framework is broadly significant. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): If there is time I might suggest another experiment, Gibbs sampling from a very large fully connected Ising model. In this case the factors of the conditional would be the contributions from the neighbours of the current variable. Even though it might not be a realistic application, it has the advantage that it is easy to understand quickly. line 136,527: "Gumbel trick" the literature on perturb and map usually calls this the "Gumbel-Max trick" ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents an approach for sampling from a discrete-valued distribution in a “large scale” regime, that is the calculation of $P(X=i)$, for each $i$ in its state space (of size $D$) involves a product of $N$ likelihood terms, with $N$ large. A standard approach for sampling $X$ would require $ND$ calculations, which can be overly expensive if $N$ is large. The paper applies the “Gumbel trick”, where with the use of $D$ iid samples from the Gumbel distribution, the problem of sampling $X$ is transformed into an optimisation one (equation (2) in the paper). By randomising the selection of the log-likelihood terms at each arm $i$, such optimisation problem is easily shown to correspond to finding the maximiser over $D$ means of finite-size populations (each of size $N$). The paper connects this the standard “Multi-arm Bundits” (MAB) problem, which provides an approach for solving the above optimisation problem by sub-sampling the $D$ populations, and delivering an answer which is correct with a “high probability” at the benefit of potentially reducing the number of population samples needed. The paper adjusts the MAB techniques to the finite-population context relevant here. It presents a number of algorithms, for instance a “Racing Algorithm”, and a similar one where a normal approximation to the running mean sampling distribution is used. The paper shows examples on a number of interesting applications, meant to study the accuracy of the method in various contexts. Clarity - Justification: The paper is well-written, I find it easy to follow. Significance - Justification: From what I understand, the main contribution of the paper is adjusting already available MAB techniques to the finite-population context of the algorithm – and the recognition in the first place that the given problem can be corresponded to an MAB one. In general, I find the contribution interesting and I believe it makes a useful contribution in a very active area of research. The paper is well written. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As in my replies above. =====