Thank you to the reviewers for your comments, we will do our best to address the main concerns.$ ## Tabular versus Tabula Rasa Learning In retrospect, we have not done an adequate job of defining these terms and have not been careful in distinguishing their usage. By tabula rasa learning, we mean learning without significant prior knowledge. Generalization, for example, leverages prior knowledge to impute values for unobserved state-action pairs from other data, and as such, learning with generalization is not tabula rasa. Tabular refers to how knowledge is encoded. One can, for example, generalize while still employing a tabular encoding. We will clarify this distinction. ## Regret Bounds for Tabula Rasa Learning The main innovation in our analysis relative to that of UCRL2 allows us to reduce the scaling of regret with the number of states from S to sqrt(S). UCRL2 does not enjoy the same bound. Intuitively, this is because, for each state-action pair, it requires concentration of an S-dimenensional transition probability vector rather than a single state-action value. We will try to better communicate this insight in our revision. Delayed Q-learning provides a similar improvement with respect to dependence on S, but at the cost of much worse dependence on other parameters. In particular, while the delayed Q-learning sample complexity bound is O(SAH^5/epsilon^4), our analysis would lead to a O(SAH^2/epsilon^2) sample complexity bound for RLSVI. To the best of our knowledge, the analysis of Agrawal and Goyal (2013) does not lead to regret bounds of the type we establish and also is missing key ingredients of our analysis. Our analysis establishes new state of the art bounds for tabula rasa learning. However, the main significance of this result is that unlike UCRL2 or delayed Q-learning, RLSVI naturally combines generalization and exploration. We do not have an analysis with generalizaiton, but we do present several promising computational results where common dithering strategies are provably inefficient. ## Infinite-Horizon Formulation In an extended write-up, we have generalized the algorithm to an infinite-horizon formulation, but have not succeeded in generalizing the tabula rasa analysis to that context. It is worth noting that the extension of the simpler PSRL regret analysis to infinite-horizon problems remains open. (A recent publication offers such an extension via "Lazy-PSRL" we have informed its authors that their analysis is wrong, and they agree). We believe that the benefits of efficient exploration can be just as large in the infinite horizon setting, but that this specific benchmark has less marked improvement due to idiosyncratic structure. ## Conjectured Bound with Generalization We agree that our conjecture poses a strong assumption. Our point is not that establishing this conjecture should be the ultimate goal of this research agenda. But proving this conjecture remains open and would represent an important step toward more general results one would hope for. ## “Gaussian Propagation” We think we understand reviewer 5’s interpretation that RLSVI relies on Gaussianity to be propagated backwards over time periods, but the algorithm does not actually rely on that. We will try to offer more intuition on this in our revision. In a sense, the algorithm is generating a noisy sample of the right-hand-side of Bellman’s equation, similarly to what PSRL would do. Except the algorithm replaces Dirchlet noise associated with sampling a single transition probability vector, as PSRL would, with Gaussian noise. A key insight that enables our analysis is in our optimistic dominance result, which establishes that the Gaussian sample offers at least the same degree of optimism as the Dirichlet sample. On a related note, we will try to provide more explanation on the relation of our algorithm to Bayesian regression and to coherent Bayesian learning, and how the algorithm deviates from coherent Bayesian learning. ## Computational Efficiency of RLSVI The most computationally intensive tasks of calculating the mean vector and covariance matrices in Step 3 and sampling from the associated Gaussian distribution in Step 4 are reasonably efficient, with complexity growing slower than the cube of the number of basis functions. ## Robustness to Choices of lambda and sigma We agree that the choice of lambda is critical for deriving the regret bound for the tabula rasa case. However, we believe that in practice RLSVI is relatively robust the choice of lambda and sigma. For instance, for the chain example considered in Section 6.1, the performance of RLSVI remains remarkably stable for a wide range of settings (see Appendix C.2). ## Variance of Tetris Scores We will run a greater number of simulations and provide confidence bars. ## References Thank you for the additional citations. We will add them to the paper where appropriate.