Paper ID: 1055 Title: Generalization and Exploration via Randomized Value Functions Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies the problem of exploration-exploitation in RL when linear function approximation is used. The core idea is that a random value function can be sampled from a posterior and used to derive the greedy policy for the next episode. This Thompson-sampling-like strategy is shown to achieve optimal regret in the tabular case and its empirical performance is illustrated over a number of problems. Clarity - Justification: The paper is clearly written. Significance - Justification: How to perform exploration-exploitation with function approximation in RL is a very important open question in the field. The paper is far from being conclusive on the topic but it provides some potentially interesting elements. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1- References. While the paper provides a quite complete overview of the literature, there are at least two papers that should be compared to: Aditya Gopalan and Shie Mannor, Thompson Sampling for Learning Parameterized Markov Decision Processes. Proc. Conference on Learning Theory (COLT), Paris (France), 2015. K.Lakshmanan, R.Ortner, and D.Ryabko: Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning, JMLR Workshop and Conference Proceedings Volume 37 : Proceedings of The 32nd International Conference on Machine Learning, ICML 2015. While the first has no function approximation, it works in the case of parameterized MDPs, which does not differ too much from a realizable case in the case of function approximation. Furthermore, it explicitly work on a Thompson strategy which strictly resembles the one proposed in this paper and in PSRL. The second paper focuses directly on the problem of approximating MDPs, although no efficient algorithm is proposed. 2- Average reward. The algorithm and the theoretical results are presented in the case of episodic problems. Is it possible to generalize it to the case of average reward in an infinite horizon problem (as in most of the regret-minimization algorithms)? 3- Comparison to UCRL2. While the comparison of the bound in Thm.1 is correct, it is really difficult to understand where the potential improvement in the bound in coming from. a) The analysis is done in a very specific setting. Maybe even UCRL2 would enjoy a better bound in this setting. b) You consider finite horizon episodic problems. Are we sure that the (more general) bound of UCRL2 could not be improved in this case? 4- Lambda. The choice of lambda is critical to guarantee the optimism of the random sampling, but the actual value that guarantees that depends on alpha^R and alpha^P which are unknown to the algorithm. How much would the performance be affected by not properly tuned parameters? 5- Optimism. Def 1 requires optimism only in expectation. It would be interesting to have also a high-probability bound. Furthermore, while the proof seems correct it seems to be very different in nature wrt Thomposon-sampling proof techniques for linear bandit: S. Agrawal, N. Goyal, "Thompson Sampling for contextual bandits with linear payoffs". In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. whose analysis should be reusable here. 6- Gaussian approximation. In the construction of b_i at step h the rewards coming from h+1 are used. Yet, the Guassian assumption is "propagated" through episodes. Even if the rewards r_H are indeed distributed as a Gaussian, this would not necessarily translated to Gaussian distributions when propagated back through steps. Does this pose a problem in the definition of the algorithm? 7- At the beginning of Section 6 is said that "RLSVI [..] acts as an effective Gaussian approximation to PSRL", please clarify. 8- The experiments are very diverse and provide a good coverage from more illustrative to more significant (in terms of application) and they provide a good support to the interest of the proposed approach. 9- It seems like the conjecture is still presented in the realizable case, but this seems like it is imposing a very strong assumption on the structure of the MDP. I.e., what is the structure of an MDP which guarantees Q^* to be linear in a given space? Recommendation ================= The paper is not providing a full solution to exploration-exploitation in linear RL but it provides a simple and efficient algorithm which is principled (and theory provides a minimum level of support for its soundness) and it is well validated from an empirical point of view. So I tend towards acceptance. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends Thompson sampling from a model-based concept to a linear function approximation context. The benefit of doing so is that it eliminates the need to plan in the learned model, making learning "direct". Clarity - Justification: Mostly, paper was clear. Some suggestions follow: "tabula rasa": I can't tell if you literally mean "tabula rasa" (blank slate) meaning learning with no prior information or "tabular" meaning learning with no function approximation. It is impressive that the regret is lower *and* the planning complexity is lower than previous approaches. Is Step 4 of Algorithm 1 also efficient? It might be worth pointing out that model-free algorithms also shave a factor of S off the bounds in the PAC-learning setting (see the delayed Q-learning paper). In some ways, the present work moves that insight to the regret setting. "repoduce" -> "reproduce". "the motivating for" -> "the motivation for". "a conjecture of result what may" -> "a conjecture of a result that may". "it may be possible to establish polynomial regret bounds": Up to this point, the paper made it sound like you have such bounds. Maybe it's the misuse of the term "tabula rasa", but I think it's important to make clear that you have an algorithm that is provably low regret in the tabular setting and that can be applied to linear function approximation but no guarantees have yet been proven. "coherent" -> "realizable"? "sensitivies" -> "sensitivities". "acheive" -> "achieve". "for phi > 0 this will typically not be the case": This example strikes me as less than realistic. Why not decrease the number of bases to be fewer than the number of states? "RLSVI is somewhat robust model" -> "RLSVI is somewhat robust to model"? "featurs" -> "features". "their approach": Not clear who "they" is. Earlier, you said there are "several papers". "linear Bertsekas features": Not defined. "even even" -> "even". Significance - Justification: In a sense, the results are still a bit preliminary---the formal results are shown in a setting that is well explored. The function approximation results are not yet proven to hold, although the experimental results are promising. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the approach is interesting and the problem important. More demonstrations of *how* the system explores effectively would be valuable. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This article deals with approximate value iteration in a finite horizon setting (new samples being collected at each iteration). To deal with the exploration/exploitation dilemma, it is proposed to estimate the (state-action) value functions using Bayesian linear regression and to sample Q-functions from the posterior to collect new samples in greedy fashion (that is, exploration is done through randomization of the value functions). A bound on the regret is provided for the tabular setting, and the advantages of the proposed approach is demonstrated on several problems. Clarity - Justification: Overall, the paper is very clear and well written. Significance - Justification: The core idea of this paper is simple, elegant and new, as far as I can tell. It is supported by both a theoretical analysis and empirical results. I think that this can be the basis of a new range of research in reinforcement learning. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a very good paper, below are some minor comments and suggestions. It would be clearer to develop a little bit the linear Bayesian regression setting (that sigma is the std of the noise in the observation model, that lambda is related to the priori), possibly in the appendix. Also, what about using the posterior of one iteration as the prior for the next one (possibly mixed with a non-informative prior)? I've not checked the proof of Th.1 in details, but it would be nice to provide a more detailed proof of this result (in the appendix for example). Regarding the tetris experiment: scores in Tetris tend to have a large variance, so it would be better to average the results on more independent trials and to show the related standard deviations. Also, it seems that it is the experiment where the improvement is the less significant (compared to the didactic problem and to the recommendation engine). One can wonder if this is due to the considered benchmark, or if the problem lies in the infinite horizon setting (where exploring through randomization of the value function would be less efficient)? Moreover, it has been suggested that using the finite horizon setting for such a problem (with infinite horizon) might be interesting ("On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Discounted Markov Decision Processes" by Scherrer and Lesner, NIPS12). What would give the proposed approach in this case? The connections of the proposed approach to Bayesian RL are discussed, but there might be also some connections to policy search (eg, "parameter-exploring policy gradients" by Sehnke et alii). =====