We would like to thank the reviewers for helping us to further improve this work with their insightful comments.$

Regarding specific comments:


"Selection of initial centers"
The cluster centers were fixed before performing the experiments and all algorithms were initialized with the same centers.


"Proposition 3.1"
We have not seen this proposition in another paper and therefore consider it to be original.


"On the details of the blocking procedure"
An evenly spaced blocking is used in all experiments. As remarked by Reviewer 4, a smarter blocking could provide tighter bounds (note that the results in Section 3 still hold), however, it is unclear how such a smarter blocking can efficiently be found. Finding the number of blocks to achieve 0.3*annz(X) is done by starting off with a static initial blocksize, and iteratively reducing the block size until annz(X_B) < 0.3 * annz(X) holds (we will include the details in the camera-ready version). In our experiments, this iterative procedure typically needed around 3-4 steps until the condition was met. Creating the block vectors in memory is very cheap compared to computing even one iteration of kmeans. The reported results in all experiments include the time to create the initial block vectors for X as well as the block vectors for the clusters.


"Questions relating to memory consumption"
In Figure 2, the memory consumption is displayed relative to the input size r (not to be confused with the absolute memory consumption). The figure implies that storing the block vectors gets cheaper (relative to total memory consumption) with increasing sparsity of X while storing yinyang groups does not profit from a higher sparsity of X. To see this more clearly, suppose the task is to cluster a large scale dataset X with annz(X) = 70 and memory consumption r into k=10000 clusters. In that case yinyang uses 14.3r memory. Incorporating the block vectors into yinyang (effectively creating fast_yinyang) adds in the worst case another 0.6r which leads to a worst case memory consumption of 14.9r (an increase of less than 5%). The influence of the block vectors on the final memory consumption of course changes with varying k / annz(X).


"Kmeans++"
Experiments for kmeans++ and block vectors have been done but not included in this work due to space limitations. Our experiments suggest that kmeans++ can largely benefit from using block vectors.


"Low &amp; Zheng, 2012"
Low &amp; Zheng, 2012 started off with Hölder's inequality showing that the max norm, achievable by setting p=infinity, q=1 / p=1, q=infinity, delivers the best approximation of the dot product when using binary vectors. They then proved and showed a compression only for the max norm case. It is not possible to generate our blockification with their technique without changing the core of their compression, which is the norm L. We will add a further discussion of the differences to Low &amp; Zheng in the camera-ready version of the paper.