AR2 :$
(1) A table of space requirements is a good idea, we will include this in future edits.
(2) Changing italics to bold is probably a good idea.
(3) "Why are simplified versions of the algorithms more responsive to BLAS?" In the case of simplifying the Yinyang algorithm, the removal of the final filter means that distances to ALL centroids in a group are computed. This means that one matrix-vector multiplication is performed to obtain distances as group centroids are stored contiguously in memory (the norm^2 of data and centroids are pre-computed). On the other hand, the final filter of Yinyang fragments the distances which need to be computed. So while the total number of distance calculations is reduced with Yinyang, each of them needs to be processed by BLAS via a separate vector-vector inner product call, which means fewer flops/second as compared to a single matrix-vector product as done in the simplified algorithm. It may be possible to copy the centroids for which distances need to be computed into a contiguous array to better harness BLAS with the full Yinyang algorithm, but this would incur additional costs. As for simplifying Elkan, we don't think that BLAS is relevant to the speed-up. The general improvement comes about because the use of inter-centroid distances does not eliminate enough data-centroid distance calculations to make it worthwhile.
(4) Are there computational operations that can be vectorized more easily? So yes, in the case of Yinyang that's more or less it, as just discussed.
AR5 :
I wasn't aware of the result that the k-means problem is not NP-hard when n grows faster than exponentially in d, interesting. We will include this in future edits.
ALL :
A minor correction to Table 7. We have been in correspondence with the developers of the mlpack k-means implementations. We have discovered that the values in Table 7 in columns mlp-ham and mlp-elk are not fair as we had originally installed mlpack in (default) debug mode. Our implementations are still faster than mlpack implementations when installed in optimised mode, but not by as large a margin. Values using the correctly installed version of mlpack will, of course, be presented in future edits.