We appreciate the positive remarks and thoughtful questions of the reviewers. We begin by noting the major contributions of the paper, and then go into more detail in the individual responses below.$In this paper, we consider a binary sequential decision problem where the goal is to maximize the probability of success. In the context of our motivating applications (medical treatments), the costs of collecting data are high, requiring that we learn from our decisions as quickly as possible. We develop a knowledge-gradient type policy to guide the experiments by maximizing the expected value of information of labeling each alternative. Due to the sequential nature of the problem, we deploy an online update for Bayesian linear classification model to make the method computationally tractable. ===Assigned_Reviewer_1 Thanks for the comments and constructive suggestions. We would like to address each of the questions in detail. 1. We would like to give more explanation on why KG outperforms EI. As pointed out in the introduction and the experimental section, KG is an exact calculation of the value of information (a.k.a. expected improvement), while EI is an approximation that ignores measurement noise. The accuracy of the calculation of the value of information becomes especially important when you have small budgets (as happens when measurements are expensive). By no measurement noise, we did not mean that EI cannot be used with noisy environments. Instead, it means that when EI decides which alternative to measure, it is based on the current predictive posterior distribution while ignoring the potential change of the posterior distribution resulting from the next (stochastic) measurement. As explained in Section 5.6 of Powell & Ryzhov, 2012, it is shown that for Gaussian process regression, the EI value essentially assumes the variance of the noise to be 0. A similar explanation can be made for our setting of binary outputs. In Eq. 12 of Tesch et al., the expectation is taken over the current posterior distribution of \pi(x). In contrast, KG takes into account the change in the posterior distribution resulting from the stochastic output y by taking an additional level of expectation. As a consequence, at each decision point, EI does not consider the change of estimate of other point x’ if we measure x (albeit the correlation is considered in model updates). Yet especially with parametric models, measuring one x can provide information about others. This is desirably captured in KG’s calculation. We hope this gives a better explanation and would like to make this point clearer in the final version. 2. We are aware of bandit problems with simple regret. Although our focus is simple regret, the point we wanted to make is that there is a simple relationship between the knowledge gradient for simple regret and that for cumulative regret. We agree with the review that problems of simple regret minimization require more exploration, and the KG policy reflects this. Take a comparison between the KG formula for cumulative regret (last Eq. on Page 5) and that for simple regret (Eq. 9), we can see that KG for simple regret leads to more exploration. We would like to add the reference and discussion to the bandit problems with simple regret. As highlighted in the paper, we would like to emphasize here UCB policies are best suited to high volume applications such as optimizing ad-clicks. The knowledge gradient is best suited for situations where measurements are expensive, as happens in health applications. 3. Even though the use of Bayesian linear (regression) model has a long history in BO (e.g. Hoffman et al.), to the best of our knowledge, Tesch et al. is the first and the only one to consider the stochastic binary problem with Bayesian linear classification model. The nonlinear transformation of the sigmoid function introduces additional computational hurdles. 4. In terms of the hyperparameters, since we use a diagonal covariance matrix with a constant value \lambda due to its equivalence with l2 regularized logistic regression, the prior mean is set to 0 and the regularization coefficient \lambda is trained on sampled datasets via cross validation. === Assigned_Reviewer_5 We appreciate the reviewers’ supportive comments. In the final version, we will add the reference and discussions of budgeted learning by Gao et al. and He et al. === Assigned_Reviewer_6 Thanks for the positive comments. In real life, it is rarely possible to have perfectly noise-free observations and our goal is to optimize under uncertainty. Unlike the common assumption in active learning, if we try the same alternative (medical choice) twice, the outcome y can be different. So in the collected data D^{N+1}, any point x can have both responses of y=+1 and y=0. The best thing to do is to find an x that is most likely to be successful. We would like to make this point clearer in the paper.