We thank reviewers for their insightful comments and suggestions.$R1: We agree that there is a rich literature on KL-divergence based clustering and we want to expand on your comment on the novelty in combination. In our experience, combining DNN with clustering objectives is non-trivial. In fact, we have experimented with various clustering objectives including K-means based, NLL based, and margin methods. We found that the proposed approach is more robust and better performing than the alternatives; some objectives, like K-means, don’t work at all. We believe discovering that DEC’s loss works well with DNNs is an important novelty and contribution. Based on DEC’s strong empirical performance, we believe that this is a promising direction and that future research along this direction will benefit from our experiences with good and bad ways of combining DNN and clustering objective. We chose to only show accuracy to make the presentation less cluttered and save space. However, in experiments, we found that NMI follows the same trend as ACC. We can include NMI in the camera ready version if space permits (please see rebuttal Table 1 below). Table 1: Method | MNIST | STL | REUTER10K | REUTERS LDMGI | 0.824 | 0.264 | 0.052 | N/A SEC | 0.805 | 0.201 | 0.318 | N/A DEC | 0.835 | 0.310 | 0.513 | 0.584 R2: We indeed found that auto-encoder pre-training was necessary for the best performance and robustness in experiments. As mentioned to R1, we experimented with various clustering objectives on top of DNN. However, DEC outperforms all alternatives we tried, including NLL and K-means. In fact, most alternatives resulted in poor performance or are too unstable to work at all. We fixed the cluster number for ease of evaluation and presentation, but agree that this is an artificial scenario. Showing performance vs. cluster number is an interesting study that we will add to future revisions of the paper. Please refer to Table 2 & 3 in the rebuttal for a preview. Table 2: Number of Clusters | 3 | 5 | 8 | 9 | 10 | 15 | 20 | NMI | 0.460 | 0.645 | 0.784 | 0.815 | 0.810 | 0.800 | 0. 767 | We agree with R1 that understanding the impact of HOG versus raw pixels is an important problem to explore. We used HOG because STL images are too big and require convolutional auto-encoders (CAE) instead of fully connected auto-encoders. Unfortunately, there is relatively little literature on CAEs and no standard network structure/hyper-parameters to follow. We plan to explore DEC with CAE on large image datasets like the ImageNet in future research. R3: On novelty: As mentioned to R1,2, discovering the right clustering objective to combine with DNN is non-trivial. We experimented with a number of alternatives and most of them performs poorly. We believe this is an important novelty and contribution. On R3’s main concern: The definition of “good” clustering heavily depends on specific applications. Therefore the clustering problem is flexible or as put by R3 “ill-defined”. However, it is still an important problem in practice; Researchers who study this problem typically simplify it with assumptions like topology preservation. Unfortunately, assumptions/heuristics like this do not necessarily extend to real applications. For example, a small brown dog on grass is closer to a small pile of dirt on grass and further away from a big dog or a small dog in different part of the image. For most downstream applications, objects are more interesting than background, so a dog should be close to other dogs regardless of the size and location of the dog. Thus A “good” cluster will consist of disjoint parts of the raw data space and “break” the topology. We argue that a data driven definition of “good” clustering is more practical than hard coded heuristics. On assessing overfitting: Table 3: Number of Clusters | 3 | 5 | 8 | 9 | 10 | 15 | 20 | Generalizability | 0.84 | 0.82 | 0.79 | 0.77 | 0.70 | 0.66 | 0.59 | It is straightforward to assess overfitting with DEC through cross validation, similar to assessing overfitting for K-means. We define Generalizability = (normalized entropy on training set) / (normalized entropy on validation set). This quantifies how well the trained representation generalizes to a held out validation set. In table 3, we observe a sharp drop in generalizability at 9 to 10, which means 9 is the optimal cluster number. This is because digit 4 and 9 are very similar in writing and therefore are deemed as one cluster by DEC. Nine clusters also give best NMI score in Table 2. This corresponds well to our results in Fig 3(a).