We thank the reviewers for their comments. All questions and concerns have been addressed and the paper has been modified accordingly.$
To Assigned_Reviewer_1:
Q1. "Including AUC comparisons if possible would be nice"

	We use RMSE to compare fairly with baselines. The AUC for MovieLens 1M is 0.923 (4-star and 5-star are treated as relevant). AUCs on other datasets will be added  when ready.

Q2. "A weak point of the model is its complexity. There should be a comparison of the complexity and some figures on the training/test time of the proposed model"

	We will add the suggested comparisons and figures. Actually, the complexity of CF-NADE is in the same range of the widely used RBM-CF. Hence, it is not problematic in practice, especially with the method described in Sec. 3.4.

Q3."The reason for considering different orderings should be better explained"

	Directly extending NADE to a deep model would be impractical (See (Uria et al., 2014) for more details). As Uria et al., showed that considering all the possible orderings will induce an efficient way to train deep NADE, we use a similar manner for CF-NADE.

Q4. "You should also define what an update is"

	An "update" is to update parameters given a batch of training samples.

Q5. "Why do you set up unknown rating to 3 and do not simply ignore them?"

	Rating 3 means a neutral preference in a 5-stars scale, which is a reasonable assumption for new items and is also used by the baselines. Simply ignoring new items is also a solution and will not change the results much.

Q6. "In sec. 6.2, I-CF-NADE model should be described more precisely"

	More descriptions will be added.


To Assigned_Reviewer_2:
Q7. "The proposed model is quite complicated" and "It would be good to see number of parameters learned by the models"

	Please see Q2.
	
Q8. "The paper starts by motivating the solution for user-based CF model, but the experiments switch to item-based CF"

	Item-based CF-NADE is only used on MovieLens 1M, which is a small dataset. User-based CF-NADE is used in all experiments.  

Q9. "Authors do not say much about software used, training times for Netflix dataset"

	The code is written with Theano, and will be released on Github. See Q2 for the other concerns.


To Assigned_Reviewer_3:
Q10. "The use of "regular cost" is somewhat self-contradicting. Figure 1 shows that there is no need to use the regular cost"

	The regular cost is a natural choice to extend original NADE for CF tasks, which is the reason we describe it. Actually, there is no theoretical guarantee that ordinal cost always outperforms regular cost, since too much regularization might hurt performances. As good results are achieved by using only ordinal cost in Fig. 1, we did not extensively tune \lambda of Eq. 22.

Q11. "I guess W_{:,m} in Eq. 2 should be the m^{th} column of matrix W. But it needs explicit definition in the text"

	Yes. We have fixed it.

Q12. "The technical contribution is incremental. Most techniques are borrowed from existing work. For example, the matrix factorization and the rank loss for the so-called ordinal loss are both well-known"

	There might be some misunderstandings. We respectfully emphasize that CF-NADE is the main technical contribution of the paper, which is quite different from existing CF methods and also differs from other NADE variants in that it models variant-length and categorical data. The other techniques mentioned by the reviewer are not at the core but extensions to the basic CF-NADE. We successfully incorporate these well-known methods in CF-NADE for comprehensiveness. Moreover, the rank loss of CF-NADE is defined on ratings, not items as in previous rank-loss-based models.


Q13. "From my perspective, the results on the three datasets are good, but not impressive"

The datasets used in the paper are among the most important public datasets for CF tasks. And CF-NADE beats all the state-of-the-art models, such as AutoRec and LLORMA. Specifically, on the Netflix dataset, AutoRec outperforms LLORMA by 1.32% ((0.834-0.823)/0.834), while CF-NADE with 1 and 2 layers outperform AutoRec by 1.34% and 2.43% respectively. Hence, the margin made by CF-NADE is at least comparable to other state-of-the-arts.

Q14. "Would the low-rating weights be “unlearned” by high-rating targets?"

	Sharing parameters is a kind of regularization, which might harm the performance if it is too strong, like what the reviewer is concerned about. However, experiments showed that it works well in practice.

Q15. "Would adagrad also alleviate this problem?"

	Yes, and we use Adam instead.

Q16. "Does the current weight sharing method induce another un-biased issue where high-rating weights are relatively rare to be optimized?"

	By sharing parameters, the weights for high-rating k are actually all the parameters W^j and V^j, where j \leq k. Hence, high-ratings weights can be optimized sufficiently.