We thank the reviewers for their feedback. We will revise the paper to remove any typos and errors and will add the clarifications below.$
Reviewer 1:
Note that recommender systems for personalized education have unique characteristics and requirements. For instance, our recommender system considers the student’s historical performance (GPA and SAT), courses taken so far, and the current CGPA of the student to issue recommendations. The CGPA depends on the courses taken, their difficulty, concepts taught and style in which they were taught as well as the student’s preferences, abilities, and knowledge, which are all evolving endogenously as they take classes. Thus, the recommender system needs to consider the evolving students’ characteristics when issuing recommendations.
Our system does not just learn the students’ and courses’ characteristics and then match them; a richer structure is exploited: our system considers the teaching style, course characteristics, as well as the students’ evolving characteristics as they take classes.
Our system may indeed be able to identify curriculum gaps by detecting that certain students have difficulties and get low GPAs in certain classes. Such findings can be brought to the attention of administrators, who can add new classes or remedial materials to reinforce student learning.
Reviewer 3:
We will revise Sec.3.3 and Sec.5.1 to address the case when data is missing. The source of missing data results as the logged dataset does not contain GPAs for every student's possible historical performance and class schedule. From Fig.1, the bias of the off-policy evaluation increases as the probability of observing the context-action pair (a,x) increases. This is expected from Theorem 2 as the predicted reward is not equal to the actual reward. Additionally, the empirical variance of the off-policy estimate also increases as the probability of observing the context-action pair (a,x) increases. This is in agreement with the results in Theorem 3 that if the probability of observing the missing pair (a,x) increases then the associated variance of the estimated off-policy value increases as well. Fig.1 also illustrates that for p(a,x) < 10^{-3} the empirical variance of the off-policy value is small even though V[r|a,x] is large. This is consistent with Theorem 3 which states that if V[r|a,x] is large and the policy \pi(a|x) places a small weight on (a,x), then the variance of the off-policy estimate will be small. In addition, expert-knowledge may be used to reduce the impact of missing data.
Complexity
The run-time complexity of the algorithm is not a big concern for its adoption since the algorithm is run only when a recommendation is needed. The most expensive operation in Algorithm 1 is computing the feasible solutions of a multidimensional Knapsack problem for each term of length T. An exhaustive search to solve this problem has time complexity O(A^{T+1}(K+1)) with A the number of course schedules, and K the number of course constraints. Moreover, if we approximate the number of feasible solutions to within 1\pm\epsilon, (Dyer et al. 1993) provide an algorithm with time complexity T2^{O((K+1)\sqrt{A}(log(A))^{5/2})}\epsilon^{-2} that is subexponential in A.
Lines 409-413: We will amend the paper to state “…the solution to \pi\in\operatorname{arg\ max}\{v_\Phi^\pi: \pi\in\Pi\} will significantly overestimate v_\Phi^\pi as \hat{v}_\Phi^\pi is upper bound by the empirical variance in (6)”.
Sec.5.3 of the revised paper will include a clear comparison of the optimized policy and data-driven policy.
Lambda can be selected by evaluating min\{\lambda: \hat{V}[u_D^{\pi(\lambda)}]- \hat{V}[u_D^{\hat{\pi}_D}] \leq d\} where d is a user selected parameter for the maximum difference of the empirical variance at \lambda and at the logging policy. Fig. 2 provides an estimate of the sensitivity of the off-policy value as a function of \lambda.
Reviewer 4:
We aimed to highlight that our recommender system uses the student's initial historical performance (GPA and SAT) in addition to the students’ evolving characteristics.
Indeed, conditions for (2) exist to ensure that it is unbiased and minmax optimal (see Li et al., 2015).
If one pre-requisite is required per course then (8) can be formulated as a precedence constraint Knapsack problem (Wilbaut et al. 2008). Additionally, Algorithm 1 can be redesigned to allow for courses with general pre-requisites. If A_j denotes the set of pre-requisite courses necessary for a_j, then the following constraint can be added to (8): floor(\frac{1}{|A_j|}\sum_{a\in A_j}b_c(a;\tau)) = b_c(a_j;t) \ \forall \tau < t.
Refer to response to Reviewer 3 regarding missing data.
Reviewer 5:
Refer to response to Reviewer 1 regarding the uniqueness of the proposed recommender system and to Reviewer 3 regarding the computational complexity.
Amended Step 2 of Algorithm 1 to “…compute the optimal policy by solving…”.