Paper ID: 198 Title: Online Stochastic Linear Optimization under One-bit Feedback Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the problem of online stochastic linear optimization, where the noisy feedback is binary and assumed to be generated by a logit model. Despite the non-linearity and even non-convexity of the logit function, the authors give an online algorithm for this problem that achieves d*sqrt{T} regret, and can be implemented efficiently if the decision domain is an Euclidean ball (with small diameter). Clarity - Justification: The paper is written well and is easy to follow. The main contributions are clearly stated, and related work is summarized adequately. Significance - Justification: The problem considered in the paper is nice, and to my knowledge, the proposed solution is novel. However, it seems to me that this is not really the problem that one wants to solve when thinking about applications (e.g., online advertisement and recommendation)---see detailed comments below. Also, the solution technique is strongly tailored to the specifics of the logit model and its wider applicability is doubted. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I'm not sure that I see how the considered setting is relevant to applications (e.g., in online advertisement and online recommendation). My concern is that the parameter vector w_star is assumed to be fixed and deterministic throughout and only the binary feedback is stochastic, whereas in the applications mentioned, the vector w is typically a *random variable* that corresponds to a random user interacting with a certain website, etc. The two settings are NOT equivalent, due to the non-linearity of the logit model (in the linear case they are indeed equivalent). In other words, it seems to me that the problem one really wants to solve is more like an online (stochastic) logistic regression problem, where on each round a fresh sample of w is used, instead of the same vector w_star in all rounds. Also, I wonder if the authors can comment on exactly what properties of the logistic model enable their solution. It seems to me that exp-concavity is not really the key property, since in generalized linear models (GLMs) exp-concavity is implied by strong convexity of the link function (which is also the case in the logit model). Can their solution be applied to any strongly convex GLM? Overall, despite my concerns above, I think that the paper solves a nice mathematical question using a novel and non-trivial algorithm, and contributes (albeit incrementally) to the literature on online optimization. More comments/questions: * You say that the problem is a "bandit setting" of an "online linear optimization" problem, but the loss functions are in fact nonlinear---they are expectations of the logit model (Eq. (4)). Also, it is not really a bandit problem, since there is a mismatch between the loss incurred and the feedback observed on each round. Please be more clear about what exactly is the problem being solved. Is it minimizing the linear regret defined in Eq (2) using feedback generated by the logit model in Eq (3)? * Please state in Thm 3 the resulting regret bound of Algo 1, in terms of the natural parameters of the problem (T, d, R). The regret bound of the algorithm has an exponential dependence on the diameter R of the decision set, and this dependence is currently swept under the rug (i.e., under the big-O notation) and the precise bound is never stated explicitly. * Line 82, "The setup of SLB assumes infinite bit precision of the feedback": I'm not sure if I agree; in SLB, the feedback signal (= the incurred loss) might as well be a binary/integer random variable, as long as its expectation is "right". * a small typo in line 304: "motives" ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an efficient algorithm with optimal regret bound for online stochastic linear bandit (SLB) under a special one-bit feedback model. The algorithm makes use of variant of online Newton step to estimate the unknown parameter and construct a tight confidence region. Although less general compared to previous work by Filippi et al., 2010, the proposed algorithm is more efficient, able to deal with infinite number of arms, and enjoys existing techniques for speeding up SLB. Clarity - Justification: The paper is very well structured and the writing is excellent. I enjoyed reading it. Significance - Justification: This work nicely addresses the computational issue of a special yet interesting case of SLB, providing a promising algorithm for practice. It also opens up some interesting future directions such as how to generalize this to adversarial setting and/or a boarder class of observation models. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1. In Eq. (12), why is there a term max(\lambda, \beta/2)? Up to Eq. (21), there is only a \lambda R^2 term. So where does the \beta/2 come from? 2. Since computational efficiency is one of the main advantages of the proposed algorithm over GLB, I think experiments should focus on this. For example, in Figure 3, it would be more interesting if x-axis is the actual running time and y-axis is the distance from the optimal, and see if OL2M indeed converges faster than GLB. 3. In Figure 1, it's easier to compare those four cases if they are plotted in a same graph. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper provides a regret upper bound (Theorem 3) for an algorithm suited for a particular case of Generalized Linear Bandit model, in which the rewards are binary and follow the logistic model. The main contribution of the paper (Theorem 1) is a new confidence region for the regression parameter w which takes the form of an ellipsoid around an estimate w_t computed using an online Newton algorithm. This allows to propose an algorithm similar to optimistic algorithm in Linear Bandit models (Lin-UCB, OFUL), based on this specific confidence ellipsoid, whose analysis then follows the classic analysis of OFUL and uses an aditional link between regret in Linear and Generalized Linear models (Lemma 1). It is highlighted in the paper that the proposed approach is computationally efficient, since it does not require to solve a logistic regression problem at each round, but builds on online logistic regression. Clarity - Justification: The paper is overall well-written, with a few minor typos or improper formulation. The proof of Theorem 1 would benefit from a few more intuition inside the main paper (see below), and the structure of the related work section should be improved as well. Significance - Justification: I find the theoretical contribution interesting: to the best of my knowledge this is the first time that an algorithm that builds on an online logistic regression method is analyzed under a Generalized Linear Bandit Model (unlike previous analysis that rely on the exact value of the Maximum Likelihood Estimator at each round). Moreover, the confidence region derived in Theorem 1 uses a nice martingale technique. However, I am not fully convinced of the practical impact of the proposed algorithm. The experiments indeed indicate that the algorithm is outperformed by the GLB algorithm of [Filippi et al.]. It is highlighted that the latter algorithm relies on the computation of the MLE at each round, which is costly. However, I bet that in practice one could use online logistic regression in GLB as well, without changing its performance much (this should at least be tried). Moreover, these algorithms should be compared to another algorithm that performs very well in practice : Thompson Sampling with a Gaussian approximation of the posterior, for which a fully online implementation is described in [Chapelle, Li, An empirical evaluation of Thompson Sampling, NIPS 2011]. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Related works and missing references. The structure of Section 2 is a bit weird, in particular Section 2.4 which is related to the adversarial version of the LB, and should therefore be just after Section 2.2 (I'm not sure a full paragraph on adversarial bandit is very relevant for the paper, since it is not explained how the proposed algorithms could be of some use here in the setting considered). The models discussed in Section 2.5, in particular Dueling Bandits, are also quite different from the setup considered, and not quite relevant. What should be discussed more on the other hand is the literature on online logistic regression. In the paper, the author use an online Newton step. I'm not an expert, but I believe there are other ways to perform online logistic regression, like Stochastic Gradient Descent (see e.g. [Bach, Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression, JMLR 2014]). As already pointed out above, it should be mentionned that all these technique could be used as heuristics to improve the computational cost of the GLB algorithm (even if it is not justified theoretically, unlike in the proposed algorithm). Moreover, I believe that in the paper by Chapelle and Li on the use of Thompson Sampling in the model considered, they proposed a computationnaly efficient algorithm. The theorems. In the proof of Theorem 1, rather that the gradient of \bar{f_t}, I would write \bar{gradient of f_t} everywhere (it is not clear whether we can really switch the two here, and what is really needed in the proof is the fact that the expectation of \nabla{f_t} is \bar{\nabla f_t}). It would be good to include the property of the martingale \sum_i b_i in the main text, so that the reader understands why Bernstein inequality can be used (bounds on the increment + bound on the variance). In the paragraph right after Theorem 3, it should rather be said that thanks to Lemma 1, the OL2M algorithm matches the rate of the GLB algorithm, which is the benchmark in the setting considered. Minor comment: it is weird that all the indices in (e.g.) (10), (11) and in the theorems statements are t+1 and not t. The experimental section is not very satisfying. First there is a long section on parameter tuning, which is not very enlightening (the optimal parameter will be different on each problem, and the theoretically allowed parameters -the full bound in (12)- is never tried + using the instantaneous regret to optimize the parameters is weird when the goal is to minimize regret). In the experiments against GLB it should be mentioned which parameters are used in the latter (theoretical parameters or manually tuned?). As mentioned above, Thompson Sampling as well as heuristic GLB with online logistic regression should be tried as well. Finally, unlike the OFUL algorithm, this algorithm does not appear to handle non-static context set (D_t in place of D), which are relevant in applications to online advertisement: can this be improved ? =====