We thank the reviewers for their helpful comments.$
The manuscript considers the problem of unsupervised ensemble learning, which is of growing interest. 
To this date, the vast majority of works on the problem considered variants of the conditional independence assumption of Dawid and Skene.
Under that regime, we proved that the solution provided by a single RBM converges to the true solution.
For the more general case, where this assumption may not hold, we gave intuitive arguments for stacking RBMs, proposed a practical and efficient way to determine the network’s architecture and demonstrated state-of-the-art results. 

1.	We agree with reviewer #1 that proving the decoupling of the features along the layers is a desirable result. We note, however, that proving such result is, in a sense, one of the most hard and interesting problems in deep learning. Despite intensive efforts by many researchers, it is currently yet unsolved. The current theoretical understanding of the great empirical success of deep learning is very limited, and is composed of many partial theoretical results. We think that the theoretical results in our manuscript, despite their limitations, nonetheless contribute to the understanding of deep learning.
2.	We disagree with reviewer #2 that RBMs are applied in our manuscript to clustering. In contrast to the clustering problem, which is ill-posed in general, the unsupervised ensemble learning problem is a well-known and well-defined problem, and is also clearly formulated in the manuscript. 
3.	Determining the network architecture is an important problem in neural nets and to this date, there is no structured way to do so. Despite this fact, in our manuscript we propose a heuristic to tackle this problem. We remark that we are not aware of any other practical work proposing such a constructive algorithm; hence, we see our proposed algorithm as one of the most important contributions of this paper.
4.	The second comment made by reviewer #2 is not clear to us, and seems not to make sense in the context of our paper. We are in a fully unsupervised scenario, and yet by construction our stacked RBM network will not assign all labels to a single class. 
5.	We agree with reviewer #4 about the need for a statistical test to confirm the superiority of RBMs on the magic data. We will provide the details of such test in the revised version. 
6.	We agree with reviewer #4 about the need of a statistical test to confirm the superiority of RBMs on the Magic data. To this end, we used paired t-test to compare the DNN result to each of the other methods. The null hypothesis of this test is that the means of the performance of each method are equal. In all four tests the null hypothesis was rejected with p-value<1e-13. In the revised version of the paper we will apply McNemar test or other suitable test to confirm statistical significance of the results.
7.	The hyper-parameters were indeed tuned on held out data, using the method of Bergstra and Bengio, and by monitoring the free energy and reconstruction error, not the accuracy (i.e., without knowledge of the true labels). We will provide the hyper-parameters of the experiments in an appendix.