Paper ID: 351 Title: A Neural Autoregressive Approach to Collaborative Filtering Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an extension of the NADE model (Neural Autoregressive Distribution Estimator - 2011) to the problem of collaborative filtering. The model closely follows the principle of NADE. Each item is considered as a variable in the neural network and the goal it to predict the ranking of a new item given the ranking history of a user. The authors present different variants of their collaborative filtering model. Training is performed using a ranking loss. Experiments are performed on three datasets, among which the large Netflix dataset. Clarity - Justification: The writing is correct but some passages could be improved. See the detailed comments below. Significance - Justification: The paper introduces some new ideas for extending the NADE model and proposes an extensive comparison with baselines. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper introduces new ideas for extending the NADE model. The main conceptual contributions is the adaptation of the NADE idea to collaborative filtering (CF) where variables are categorical instead of binary like in NADE. Another contribution is the adaptation of the model for very large dimensions corresponding to the number of items or to the number of users. Another strong point is the experimental evaluation where different variants of the model are compared with baselines on different collections. The comparison is however only performed for the MSE loss when recommendation is more and more evaluated using a ranking criterion like AUC. The best model for MSE often does not correspond to the best one for AUC. Including AUC comparisons if possible would be nice. A weak point of the model is its complexity. The size of the variable space is extremely large, and different adaptations are necessary for using the model on large datasets. There should be a comparison of the complexity of the different models used in the experimental section and some figures on the training/ test time of the proposed method. In the description of the model (sec. 3.1) I found that presenting the model with one NN per user and tied parameters introduces confusion. Maybe a better way would be to formulate the model as one global model for all users and different training objectives for each user. In section 5 the description of the deep extension is not clear. The reason for considering different orderings should be better explained. You should also define what an update is. In the experiments, why do you set up unknown rating to 3 and do not simply ignore them? In sec. 6.2, the I-CF-NADE model should be described more precisely. This is not easy to make the transition from the model already described. Pro: extension of the NADE model to categorical data and adaptation to large variable spaces. Experiments on large datasets. Cons: complexity of the proposed model. No complexity comparison and no execution time figures. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a neural autoregressive approach to modeling the distribution of a user’s rating in the recommender systems. The paper also presents several techniques to further improve the prediction performance. In particular, the paper applies matrix factorization to alleviate the issue of the large weight matrix. Since the output of the auto-regressor is a softmax that treats each rating as an independent class, the paper introduces another loss to force the predicted probability of each rating has certain ranking structure, which is more realistic in the recommender systems. Clarity - Justification: The technical part is easy to follow as most methods are well-known tricks. However, the presentation and organization leaves something to be desired. For example, the use of "regular cost" is somewhat self-contradicting. Figure 1 shows that there is no need to use the regular cost. Also, some notations are not explicitly defined. For example, I guess W_{:,m} in Eq. (2) should be the m^{th} column of matrix W. But it needs explicit definition in the text. Otherwise, it might confuse the readers. Significance - Justification: The technical contribution is incremental. Most techniques are borrowed from existing work. For example, the matrix factorization and the rank loss for the so-called ordinal loss are both well-known. That being said, it is nice to incorporate them into a single model. So the significance of this paper should be highly dependent on the empirical results. From my perspective, the results on the three tested datasets are good, but not impressive. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In addition to the comments above, I have the following detailed comments: - The paper introduces the parameter sharing method in Section 3.3 to alleviate the problem that weight matrices associated with rare rating are rarely optimized. In the current method, low-rating weights are always used when the target is high rating. Would the low-rating weights be “unlearned” by high-rating targets? Also, would adaptive gradient descent (adagrad) also alleviate this problem by giving rare weights more learning rates? At last, does the current weight sharing method induce another un-biased issue where high-rating weights are relatively rare to be optimized, since low-rating weights always get the chance to be optimized? - One possible explanation of why the “ordinal cost” introduced in Eq (18) is much more important than the "regular cost”, is that the regular cost actually has been incorporated in Eq (18). What the regular cost in Eq (3) does is trying to maximize the score for correct rating and minimize the score for other ratings. Eq (18) also behaves the same. In addition, Eq (18) further enforces that the rating should follow a ranking structure. - Since the empirical results are not impressive, it would be great if more motivations could be provided. For example, what additional advantages other than good test RMSE does the proposed method offer? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes Neural Autoregressive approach to collaborative filtering. Authors propose a model for collaborative filtering that tries to predict user rating vector based on previous ratings by the same user. Authors propose a lot of enhancements over the first version of the model: rating invariant parameter, sharing parameters between ratings, use likelihood loss, and, finally, make their model deep by adding another layer. Authors also propose to use some low-rank approximations for the parameters. The performance of the model is comparable with state of the art models, such as LLORMA, etc. Clarity - Justification: Paper is well-written. Significance - Justification: The paper proposes quite complicated model for collaborative filtering, and achieves good results on known benchmark datasets. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed model is quite complicated and has a lot of parameters and builds upon work on NADE (Neural Autoregressive Density Estimator) and RBF-CF. Authors propose a model for collaborative filtering that models user (or item) rating vector by using chain rule as in Equation (1). Authors propose a lot of enhancements over the first version of the model: *rating-invariant parameter *sharing parameters between ratings *likelihood loss *adding more layers *low-rank approximations for the parameters The proposed final model shows impressive empirical results on benchmark datasets. It looks like autoregressive part is quite important and having ordinal cost function (likelihood loss). Adding rating-invariant parameter, sharing parameters between ratings and adding more layers helped to improve the performance. The paper starts by motivating the solution for user-based Collaborative Filtering model, but in the experiments authors switch to item-based Collaborative Filtering, the latter has more parameters, since there are more users than items in the datasets considered. Authors do not say much about software used, training times for Netflix dataset, it would be good to mention that in the paper. Also it would be good to see number of parameters learned by each of the models in table 1. =====