We thank all reviewers for their thoughtful comments. $
Reviewer 1 and Reviewer 2: 

We will improve our paper based on their suggestions. We will add more details of Arthur/Vassilvitskii and Chawla/Gionis algorithms, discuss the stability and robustness of our algorithm, and correct the typos. 

Reviewer 3:

We feel strongly that our work is not fairly and seriously treated by the reviewer, and would like to take the opportunity to clarify any possible misunderstanding. 

The reviewer judged that our algorithm can be easily understood as a “baseline with poor novelty”. Unfortunately, we have not been able to find any previous work on communication efficient k-means clustering with distributed dimensions. An obvious baseline algorithm is to communicate all information to a central server and run the centralized algorithm, which would cost O(nd) total communication, while our algorithm only cost O(Tn), which is a significant save in the typical case where d is much larger than T.

The reviewer claimed that our algorithm “clearly does not work well” for irregular dimensions. First of all, we want to point out that our algorithm is not responsible for the partition of the dimensions. We are handling the case where the partition is already predetermined (we explicitly and extensively discussed such application scenarios in the introduction) and we would like to do clustering over such distributed data. Moreover, our algorithm can also handle the case with irregular dimensions. The algorithm is exactly the same and the analysis and approximation guarantees are the same.

The reviewer also raised the question of network failures. The focus of our work is the algorithm design part, not the system part. Our algorithm only requires one or two rounds of simple data communication and can be easily implemented on top of any modern distributed data processing system such as Hadoop and Spark, and let the system to handle network failures. Even such a system is not readily available, we can simply pack the memberships in a file (or small files) and transmit the file to the central server using any standard protocol (e.g., TCP/IP). 

The reviewer questioned the size and dimensionality of the datasets, and reproducibility. The datasets we used are fairly standard and have been used in previous related work. Our algorithms (in particular the first one) are very easy to implement, and we are happy to make the simulation code public (after the work is published).

Reviewer 7:

The main focus of our paper is to optimize the total communication cost. As the reviewer pointed out, saving memory capacity or computation cost are important motivations, which we absolutely agree. In fact, balancing the computation load is a by-product of our algorithm. Basically, each party only needs to compute a local k-means, and the central server only needs to compute a k-means for the weighted grid constructed by the algorithm, which is small. 
Also note that in our theoretical analysis, we assume T is a constant, but our algorithm (Algorithm 1) works for any T (in our experiment, we choose T=100 for our second dataset). When T is large so that k^T>n, we only construct nonempty grid points (see Remark 1 in the paper). The approximation ratio and the communication are still the same as reported in Table 1.

We suspect that the reviewer may assume that the whole dataset is initially stored in the central server. In the question (5), the reviewer mentioned that ``This is needless, since the central server can compute…..”. In our setting, the central server does not hold the whole dataset. As, we argued in the introduction, in many application domains, the data partition is already there, and our algorithm starts with that partition.

Detailed comments (1) and (2):
We apologize for the complication of the notations. We will provide detailed verbal explanation in the next version. Let us provide a simple example to clarify our algorithm. Consider for example T=3 and k=5. Suppose the point A is in the 1st cluster in party 1, 3rd  cluster in party 2, 5th cluster in party 3 (note that in our distributed dimension setting, every point appears in every party). Then point A is associated to the weighted grid g_{1,3,5} (the weight of g_{1,3,5} is |M^1_1 \cap M^2_3 \cap M^3_5|). Clearly, every point can be associated to such a weighted grid (indexed by the T-dim vector indicating its memberships in different parties), hence the union of all these [k]^T intersections covers all the n points.

Detailed comments (3):
The value of T only affects the communication complexity, not the approximation ratio. In fact, in all prior work listed in Table 1, the approximation ratios are independent of T. 
Roughly speaking, the weighted grid we construct can be thought as an approximation of the original point set. The approximation ratio is caused by the quality of the approximation by the grid, which does not degrade as T getting larger.