Paper ID: 529 Title: The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies a kind of Bayesian zeroth-order stochastic optimization problem, where they can sequentially query points x to observe responses y, with the objective of identifying a final output point x having large P(y=+1|x) (according to the posterior distribution given the observed responses to queries). They specifically study a technique known as "knowledge gradient", which essentially chooses queries attempting to maximize the expected increase in the value function, in an MDP interpretation of the problem. The innovation in the present work is extending this approach to a model of binary response variables, rather than real-valued responses (with the natural generalization to maximizing the regression function at the final point x). There are many relaxations and approximations required along the way, to make the method computationally tractable. They study the behavior and performance on a variety of natural and artificial data sets. Clarity - Justification: The paper reads well, with high-level descriptions motivating key steps accompanying the technical derivations. Significance - Justification: They extend the method to a model of binary responses. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The abstract approach seems reasonable to me, and the approximate implementation appears to shine in the empirical comparisons. One question I have: I am not entirely sure I understand the motivation for wanting to find a point x with large P(y=+1|x), rather than merely wanting a point x among the collected data D^{N+1} whose response variable y=+1. In the medical application from the intro, wouldn't we only need one of the procedures in the sequence to be successful? Perhaps the authors can clarify the motivation for this. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an extension to the Knowledge Gradient algorithm which allows for Binary feedback rather than real-valued realizations. They also propose to use a Bayesian linear model rather than a GP to reduce computation. Clarity - Justification: The paper is reasonably clear. Significance - Justification: See below. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The closest competitor to this method is the extension of Expected Improvement to binary-valued observations in Tesch et al. The authors do show improved performance over that of EI, and ultimately this would lead me to argue for a weak acceptance, however it is unclear to me where this improved performance is coming from. The authors claim that the KG is equivalent to EI with case of deterministic observations, however this is obviously not used in the case of the Tesch work. This also somewhat ignores the widespread use of and extension of EI (i.e. not EGO) to noisy environments in machine learning. See for example the tutorial of Brochu et al, or the thesis of Michael Osborne (and earlier work). The authors also claim that bandit problems focus on the setting of cumulative regret, and rightly much of this work is focused on this area. But the problem the authors are most interested in is equivalent to a simple regret problem, see for example the long line of work from Audibert, Bubeck, Munos and others. This work has even been applied to Bayesian optimization problems as in that of Hoffman et al, "On correlation..." examines the simple regret of a BO algorithm. More recently work such as that of Entropy search attempts to attack similar problems albeit with no theoretical guarantees (see Hennig and Schuler as well as more recent work from Hernandez-Lobato et al as well as the earlier IAGO work of Villementeix). Interestingly though, problems of simple regret minimization often require more exploration which seems somewhat at odds with the KG which due to its similarity to EI may be more suited, at least theoretically, with cumulative regret problems---although this is just conjecture. Finally the use of Bayesian linear models has a long history of use in BO. Most notably that of Snoek et al, but also in the aforementioned work of Hoffman et al. Ultimately the work appears interesting, and seems to provide results which improve over EI, but the "why" of this result is not really examined---especially due to the similarity with EI. Finally, it is often seen that the hyperparameters play a huge role in Bayesian optimization and it was unclear how the authors approached this problem. Where they sampled or optimized? And if optimized what was the effect of this choice, which is often too myopic in early iterations. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors address a problem of sequential decision testing problem. Where each test is represented by a feature vector and has am unknown binary outcome and may be time consuming or expensive or both. The problem is to sequentially select different test in order to learn the test with the highest likelihood of positive outcome. The authors cast the observation model of a test outcome as a linear classier with a prior on the parameters, model the sequential decisions as a Markov Decision Process and develop a policy that selects a test that maximizes the value of information given the current state. Clarity - Justification: The paper is very well written. Most concepts are clearly explained. Significance - Justification: The paper proposes a novel approach for binary sequential decision under budget constraints. They develop extensions to the knowledge gradient for MDP with binary classification observation model. They also provide convincing experimental results illustrating the advantage of their approach to other policies on several datasets. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem of sequential decisions under budget is an important and authors provide an interesting solution approach. I think it is a very well written paper, of interest to many. The authors may want to address related (although not exact work) done in learning MDP policies for budgeted learning tasks: Gao, Tianshi, and Daphne Koller. "Active classification based on value of classifier." Advances in Neural Information Processing Systems. 2011. He, He, Jason Eisner, and Hal Daume. "Imitation learning by coaching." Advances in Neural Information Processing Systems. 2012. =====