Paper ID: 231 Title: Unsupervised Deep Embedding for Clustering Analysis Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a method for embedding a collection of points into a space (using an auto-encoder) in which they can be clustered and jointly optimizing this embedding and cluster centers via end-to-end backprop. To do this, points are associated to a fixed number of clusters using soft assignments and the authors propose an objective (in addition to reconstruction cost) which encourages these soft assignments so that cluster predictions are stronger on average and datapoint closest to cluster centers get more weight, etc. Clarity - Justification: Overall this is a well-written paper and easy to follow. Significance - Justification: On the plus side, experiments do show some improvements over k-means and spectral clustering (which does not scale easily to large datasets). However the idea seems incremental (mostly a combination of well known neural network ideas) and as discussed below, there is no clear way to assess overfitting. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): + My biggest concern is that the problem of simultaneously clustering and learning an effectively unconstrained nonlinear embedding seems ill-posed… there is no clear way to define what it means to be a “good clustering” if the underlying metric is allowed to change arbitrarily since I can always move points originally close to each other to be far apart. If clustering in an embedding space, it seems like we should at least ask that the embedding function be topology-preserving — what this means in practice is that we need a way to make sure that the embedding function preserves local neighborhood structure (small distances should stay small even after mapping). It could be that the auto-encoder in this paper manages to do this accidentally, but this topology-preserving behavior seems like it would be highly dependent on network architecture as well as initialization strategy. It would be better to explicitly encourage neighborhood preservation in the objective function rather than just hoping that it happens. Alternatively if all we want is a lower-dimensional embedding that preserved distances… then we might as well use PCA or random projections (which approximately preserves distances by the Johnson-Lindenstrauss thm), which the paper currently does not try. + Another weakness of the proposed approach is that there is no clear way (as far as I can tell) to assess overfitting, which is related to model selection (i.e. selecting the number of clusters to use). However there are at least a few standard ways to do this in the clustering literature. Related to this point, there is a resemblance between the objective that the authors use and that used in variational auto-encoders (see e.g., Kingma+Welling 2014) in which the objective is also a combination of reconstruction score and KL on the bottleneck layer. The difference is that because variational auto-encoders are associated with a clear probabilistic model, one can evaluate log-likelihood of a validation set, and can thus easily assess overfitting. It may be that what the authors are proposing (or something similar) can be recast as an instance of the VAE framework. + The proposed method (DEC) performs reasonably in experiments but the authors have only tried problems with a small number of clusters so far so it’s unclear if performance would still be reasonable for more clusters — furthermore in problems with more clusters, the number of clusters is often unknown. Using end-to-end backprop seems to only lead to modest improvements over just taking auto encoder embeddings and doing a post-clustering step. My suggestion is to use the learned feature embedding to train a supervised classifier (say using an MLP) — this would tell us if the inaccuracies are due to clustering or the feature embedding. Another small comment: All the plots in Figure 2 have different y-axes, making the differences seem larger than they actually are — these plots should all have compatible y-axes (or at least all start from 0). ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an unsupervised method for clustering using a learned representation. Starting with features from an autoencoder and centroids initialized using kmeans, the mapping and the centers are refined using gradient backprop. The method is validated on multiple datasets and its accuracy is compared with other clustering methods. Although there are similarities to other representation learning algorithms, this submission has novel aspects and is quite interesting. Clarity - Justification: Very clearly written. Figures and results tables are clear. Significance - Justification: The proposed method for embedded clustering is clearly related to other representation learning approaches, but the actual algorithm is novel. For researchers interested in unsupervised clustering, this is a significant contribution. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The proposed method bears a strong resemblance to other distance metric learning algorithms such as DRLIM (hadsell et al 2005) and Fei Tian, 2014, but it has significant differences. The centroids of the representation are explicitly specified, and the representation is iteratively refined based on the samples with higher confidence based on a soft assignment to the different clusters. Method: In order to initialize the clusters and the assignments, the authors use an incrementally trained denoising autoencoder, then apply kmeans clustering in the feature space. Is this necessary? The random projections of an untrained neural network might actually be sufficient to initialize the clustering. It would be interesting to see if this is the case. A KL loss is used to train the features further, using a lower temperature + normalized version of the current soft assignment vector. This is a smart choice, as it preserves the influence of other clusters. It would have been nice to see a little more discussion on this point, and perhaps experimental evidence that the KL loss works better than, for instance, a one hot NLL criterion with a lower learning rate. Empirical results: the experiments are chosen well. It is important to separately evaluate the contribution of the autoencoder features and the contribution of the optimization for clustering, and the authors have done that by evaluating other clustering algorithms on autoencoder features. The authors note correctly that in the truely unsupervised case, determining hyperparameters through validation set testing is not possible, so they limit any tuning and instead use published hyperparameters without tuning to the different datasets. However one issue that the authors seem to entirely ignore is the number of clusters; they apparently assume that the ground truth number is given. This is a common assumption in clustering, but obviously unrealistic. It would be good to have this addressed by the paper. The authors do not use the images of the STL dataset but rather use HOG features. It would be very nice to know if this is needed, or whether the method can be used fully end-to-end on images. Obviously there are issues with registration in the image and gross pixel-space differences that may dominate the initial mapping. This is still important to understand. The evaluation of DEC on imbalanced data is nice to see. There is a sharp drop at the .1 level, but at least it is not catastrophic. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents Deep Embedded Clustering (DEC), an approach of clustering using deep neural network (DNN) embedding. The key idea in DEC is that the data is mapped from the original space to a lower-dimensional feature space via a non-linear transform, where the mapping is learned using DNNs. So, DEC jointly does feature embedding and clustering. DEC is evaluated on some domains (e.g., text, image clustering), and compared to some baseline algorithms -- DEC is shown to perform well in speed and accuracy, and also be robust to the value of the hyper-parameters of the algorithm. Clarity - Justification: The exposition in the paper is clear — the problem is well formulated, and the solution (e.g., the algorithm, the empirical validation) is clearly outlined. Significance - Justification: Each component of DEC is not very novel (and is based on some existing work): - Stacked auto-encoders for learning an embedding is based on previous work. - The clustering solution is well known (KL-divergence based clustering), and has been studied extensively both in the hard-clustering and probabilistic clustering setting (e.g., http://www.jmlr.org/papers/volume6/banerjee05b/banerjee05b.pdf). Using a DNN stack to do such a clustering is interesting but not very novel. But these components are combined in a novel way to come up with a joint embedding and clustering algorithm DEC. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is somewhat weak on novelty (as mentioned above), but overall the DEC algorithm is interesting and performs well on different benchmark datasets empirically. Apart from clustering accuracy, it would be useful to use some other evaluation measures (e.g., normalized mutual information) in cases where the ground truth (cluster labels) are known in the data. The authors have demonstrated robustness to hyper-parameters -- it would also be useful to know other aspects of DEC, e.g., robustness to noise. =====