We thank the reviewers for their encouraging comments. We will address their major questions.$
--- Assigned R4 ---

>> The term "label" to describe the feature value is a bit confusing.
We will fix this in the final version.

>> A toy example.
We will add a toy example to illustrate the algorithm in the final version.

>> First the meta-clustering and then the choice of tree model. 
Indeed, this is a good suggestion. We will improve the presentation accordingly.

>> Parallel properties.
Yes, this is absolutely correct. Once the tree model is estimated. We can compute the gradient ascent path for all data in parallel and record the modes they converge to. In the end, we just need to go through all data once and identify data associated with a same mode as a same cluster. We did briefly mention this in the paper (Line 348 to 350). Being embarrassingly parallel makes our method rather appealing compared with others such as mixture models, k-means, k-modes, rock, etc. We will elaborate this in the final version.

>> Discuss the complexity of other methods.
We will address this in the final version.

>> Why and how to compare to mixture of Gaussians (MoG). 
As shown in the paper, methods assuming a continuous domain (MoG, K-Means, etc) have computational advantage and perform reasonably well on discrete data. To apply these continuous methods, we map a variable of L possible categorical values to L binary variables. The l-th variable has value one if and only if the original variable has value l. This way we map the categorical data into a binary-valued high dimensional domain. We further relax the domain into a high dimensional zero-one cube by allowing each variable to take real values between zero and one. 

>> More explanation of the mixture of multinomials.
Yes, we will give a thorough explanation of baselines like mixture of product of discrete distribution (MPD), mixture of multinomials (MMN), latent class analysis (LCA), etc. 

>> Running on larger datasets.
Besides UCI, we also run on DNA Barcoding datasets. These DNA datasets have up to 4000 samples and up to 1000 dimensions. Overall, our algorithm is linear to the dimension and almost linear to the data size. It is also embarrassingly parallel, as you correctly noticed. We do note that the tree-model estimation algorithm is quadratic to the dimension. This is reasonable for a model-based method.

--- Assigned R6 ---

>> Is there an objective function?
The proposed method clusters data based on the topographical landscape of the density function, and does not have an objective function. We will improve the presentation by describing first the meta-clustering approach and then the choice of the tree model (thanks to the suggestion of R4).

>> The novel aspects of the algorithm?
The contribution is three-folds:
1) We formulate the problem of discrete data clustering as the grouping of discrete data points based on the stationary points of their discrete gradient ascent paths;
2) We propose a new algorithm to efficiently compute the discrete gradient ascent using a tree-structured graphical model. The tree-model strikes a balance between statistical power and computational efficiency. The proposed gradient ascent algorithm computes the ascent path for each data point in parallel.
3) From the theoretical aspect, the method also computes topographical features of the discrete data landscape, e.g., modes and attractive basins.

>> It is hard to know what the numbers mean, and why the algorithm performs better.
Clustering is a classic problem with many existing solutions. We did thorough experiments and compared with many diverse baselines to show that our method works well in practice. We used the synthetic example (Figure 2) to illustrate the rational of our method: it identifies clusters based on the landscape of the underlying distribution. Attractively, our approach does not have any geometric assumption about the clusters.

--- Assigned R7 ---

>> Concerns about the running time.
We agree, but we'd like to re-emphasize the fact that our algorithm is embarrassingly parallel (please see response to R4 above).

>> More explanation about the synthetic dataset. What is "randomly flipped"?
We will provide more details in the final version. "Randomly flipped" meaning we assign a random value to an attribute.

>> What is ACG data?
ACG is one of the DNA Barcoding datasets in [Kuksa and Pavlovic 2009]. It consists of DNA samples from the species-rich fauna of Area de Conservacion Guanacaste (ACG) in northwestern Costa Rica.

>> Test against single-linkage methods? 
ROCK could be considered as a successful adaptation of the single-linkage method into the discrete domain. The similarity between clusters during the agglomerative merging process is measured using the number of shared neighbors.

>> Report results on spectral clustering. 
Yes, we will include it in the final version.