Paper ID: 750 Title: Bounded Off-Policy Evaluation with Missing Data for Course Recommendation and Curriculum Design Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper focuses on the problem of personalized course recommendation using only limited and historical data, including off-policy evaluation despite missing data. The authors strongly motivate the subtleties of the problem of course recommendation and explore several relevant technical challenges, including proving bounds and developing off-policy evaluation estimation with missing data. Clarity - Justification: Well written and very thorough; the authors cover a large amount of material coherently. Significance - Justification: The authors establish an interesting setting and application to motivate nontrivial technical challenges; these techniques should be useful in other off-policy evaluation settings and in settings with historical/observational data. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I found this paper well motivated practically, with a natural application of pre-existing methods combined with compelling theoretical challenges. This paper is outside my area of expertise, so I am limited on how well I can comment on the methods; I did find the figures somewhat difficult to interpret. A few questions: > For the off-policy evaluation with missing data, is there a stronger connection to more general recommendation systems where we can augment different, pre-existing models? > For the tradeoff between curriculum design (ensuring students' education is optimized in a small constant number of classes) and student performance, are these results equivalent to inferring different *levels* of classes (e.g., based on ability), perhaps simultaneously to learning student levels? Or is there a richer interpretation available here about students on different tracks (either by topic or by teaching/course style)? > Finally, do you think this method could be applied to curriculum design to identify gaps in the existing curriculum? Some typos, including a number of questions as sentences, and particularly: Pg 4: "How can the dataset ... two policies." -> policies? Pg 5: "That is how does..." missing comma and question mark, or rephrasing Pg 5: "We provide guideline" -> guidelines ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a contextual bandit model for recommending personalized (sequences of) courses to students. Off-policy (i.e., learning from log data) is performed using an unbiased estimator with finite sample convergence. The authors also analyze the impact of augmenting the log data with expert data (e.g., in regions where the log data is too sparse or non-existent). Finally the authors propose an algorithm for performing course recommendations, the algorithm optimizes for policies that trade-off quality and variance, and an empirical evaluation is performed. Clarity - Justification: Clarity. I found the paper to be very clearly written. The technical parts are also well explained and relatively easy to follow. I believe the results would also be reasonably easy to reproduce given the details given in the paper. Significance - Justification: Significance. This paper introduces a principled model for a novel application. My understanding is that the general off-policy evaluation technique proposed is fairly standard. However I believe that the data augmentation technique is novel and seems quite general. The proposed algorithm is interesting. The empirical evaluation nicely explores the problem and provides insights into the solutions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Other comments: - Some of the work cited in the related work Section about recommending relevant courses based sounds personalized yet you imply it isn't. Could you clarify why? - Are there any conditions (e.g., on the logging policy) that need to be met in order for the regression estimator you use to be unbiased and minimax optimal? - Your proposed algorithm (Algorithm 1) does not seem allow for courses to have prerequisites. If that's correct is there an inherent difficulty in taking them into account beyond the difficult of the resulting IP? - Figure 1. It would be good to add a legend (grey: variance. black: bias) - In practice how would one determine the impact of adding missing data? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This is a very application focused paper that frames the problem of selecting a course schedule or designing a curriculum as a contextual multiarmed bandit problem. Context is a measure of a student's academic aptitude (GPA, SAT scores, etc), actions are the selection of a set of courses to take, and the reward is the average grade that the student receives in the courses. Thus the goal is to select courses that maximize the student's grade. A data set of the context, actions, and rewards for previous students is assumed to be provided. The paper uses a regression estimator to estimate the value of a given policy using the provided data set. Since the data set is likely generated by a different policy than the one being evaluated, there is an issue of missing data in that the policy being evaluated may use actions not present in the data set. They discuss the effect of including this missing data by estimating the reward of unobserved context-action pairs using expert domain knowledge. To deal with the effects of this missing data, the authors introduce a regularized search for the optimal policy that penalizes policies according to how their action distribution, as well as the context distribution being considered, differ from that observed in the data set. This approach is justified through analysis that explores the upper bound of the results from the regression estimator. The penalty term is derived from this upper bound such that minimizing it reduces the possibility of overestimating the true value of the evaluated policy. The authors also extend this optimization method to the problem of designing curriculum for a student to ensure that he/she graduates as soon as possible while attaining the required knowledge. Clarity - Justification: This paper would benefit from a thorough editing to improve readability and fix a few typos/errors. There are a couple of sentences that do not make sense as written, and others that end abruptly. There are also several sentences that are questions, yet are not punctuated as such. A notable typo that should be pointed out is in step 2 of Algorithm 1, where what I believe is actually the optimal value is referred to as the optimal policy. Significance - Justification: Experiments are performed with a real data set from an accredited university. These results suggest that the approach is sound, and highlight how it provides schedules that are personalized and take into account a student's background and previous performance. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The manuscript provides a fairly straightforward application with an MDP formulation. The methods appear to be sound and it improves upon existing techniques to address the specific issues present in the application. It may benefit from discussing how the techniques can transfer to other related applications. There is no discussion pertaining to the the computational complexity of the provided algorithm, which requires solving many binary integer nonlinear programs, aside from mentioning that efficient algorithms exist for solving the programs. ===== Review #4 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper aims to tackle evaluation of policies for selecting courses for students treating it as a contextual bandit problem except with the constraint that the only policy feedback is that from logged data (i.e. previous course performance for students). Details below: Given a students past performance in courses of different subjects (maths, chemistry, etc.), (one of) the goal(s) of this paper is to device a policy for selecting courses. Furthermore, in evaluating a policy that is different from the logged data selection policy, the authors want to do better than a simple regression estimator (that simply computes expected reward by a weighted sum of past rewards) by taking into account the variance of policy value estimator. The authors provide a covering number based bound for the evaluating a policy. Then a solution to find an optimal policy is presented which augments the baseline empirical regression estimator with a variance minimization term. Finally the authors discuss the application of their method to curriculum design. Clarity - Justification: The paper is reasonably well written in most parts and the authors have tried to explain the results well. However the entire section about Mission data seems unclear to me. It is not clear to me what is the role of missing data and what are the likely natural sources of missing data. The authors comment on line 481 doesn't quite ameliorate this issue. Also there are some errors and erroneous claims in the paper: 1) Line 196: it should be x_t and not r. 2) The lines 409-413 do not seem to be correct as (2) may under as well as over estimate the value of v\pi. Significance - Justification: Theorem 1 is the main theorem of the paper and it seems like a simple advancement over previously known results. The curriculum recommendation methodology also seems too difficult to be considered solid (see below): Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The idea of minimizing the variance of a policy valuation along with maximizing expected reward seems useful. I also, for the most part, like the fact that the authors have tried to explain their results (except for the missing data part). However the application of this idea in Algorithm 1 seems quite impractical seems because enumerating over all possible feasible course schedules might be intractable (I'm also thinking about other sequential applications of contextual bandits). I think the paper can benefit from a greater elaboration of section 5.3 which compares optimized policy with the solely data-driven policy. Also, perhaps a less ad hoc process for selection lambda than what is described in section 5.2 can also be useful. How sensitive the outcome is to the value of lambda (looking at Figure 2, one could also pick lambda = 0.5). =====