We thank the reviewers for their insightful comments. Please find our response below.$
[Assigned Reviewer 1]
[Q1] From the empirical perspective, is there any way to eliminate this repartitioning step, while still retaining very good empirical performance?
[A1]The random partitioning is applied to guard against the worst-case. As the reviewer correctly points out, a simpler strategy that reduces communication cost should suffice in practice. For instance, consider the following strategy in which the random partitioning is computed before the first round. 
1
|\
1 3
|\  |\
1234
In the above tree with 4 machines, in the first round machine 2 sends its output to machine 1, and machine 4 sends its output to machine 3. In the second round machine 3 sends its output to machine 1. While we can't provide theoretical guarantees for this strategy, it should work well in practice.

[Q2] It may also be interesting to investigate an even more restricted scenario: mu < k.
[A2] We note that submodular evaluation oracle calls for sets of size greater than mu are not feasible (or at least should be formally modeled). With this restriction the best approximation guarantee is mu/k which could be small if mu << k. Nevertheless, it is an interesting direction for future research.

[Assigned Reviewer 2]
[Q1] What is G_k and when it would be non-monotone?
[A1] G_k (as defined by Kumar et al.) is a function that takes a set A and returns a subset of A with at most k elements. Furthermore, it has to be monotone: if A is a subset of B, G_k(A) must be a subset of G_k(B). Here are two critical examples for which it is non-monotone: classic greedy and stochastic greedy. This implies that greedy can't be used as a sub-procedure for the Sample&amp;Prune algorithm.

[Q2] Comparison to GreedyScaling and ThresholdMR of Kumar et al.
[A2] We strongly believe that our approach and the approach of Kumar et al. are complementary and both have merit. Kumar et al. provide a more general framework and show that it can also be applied (rather elegantly) to several submodular maximization problems. Since the main goal of Kumar et al. was not to handle the case of limited capacity, it's maybe unfair to compare it to our approach. Here is why:

Firstly, notice that each call to Sample&amp;Prune (page 4, Kumar et al.) requires at least mu = max(|X|, |M_s|) space. Hence, by Lemma 6 this implies that mu >= log (n) * sqrt(2nk), with high probability. Given this capacity our algorithm always terminates in two rounds and we empirically observe approximation ratios very close to one.

Secondly, to ensure monotonicity in the ThresholdMR (and GreedyScaling) algorithm one has to:
a) Estimate the maximum marginal gain Delta.
b) Introduce (log Delta)/epsilon different thresholds.
c) Invoke the two-round map-reduce sample &amp; prune procedure in parallel *for each threshold*.

Hence, the number of oracle calls is at least (n log k)/eps. With GreedyScaling the situation is not better since log (k) is replaced with log (Delta). In contrast, with classic greedy as the compression algorithm we get O(nk) oracle calls. In practice, using stochastic greedy yields O(n log(1/eps)) oracle calls.
Critically, to execute the ThresholdMR algorithm in parallel one necessitates n * (log Delta)/(mu * epsilon) machines. Since we are primarily interested in small approximation errors, running this algorithm with the current experimental setup is not feasible! In contrast, our algorithm requires n/mu machines which is optimal. As a consequence, the overall computational complexity as well as the simplicity of implementation are greatly reduced.
We also note that we can use the thresholding-based algorithm as a compression subprocedure. On the other hand, Sample&amp;Prune can't use the classic greedy as a pruning procedure.

[Q3] What exactly is the relative approximation error shown on the vertical axis in Figure 2?
[A3] Thank you for noticing this typo. It is indeed the approximation ratio with respect to the centralized solution (as stated in Table 3).

[Q4] Do you have any intuition as to why the stochastic methods work well on WEBSCOPE but not as well on the TINY dataset?
[A4] Our intuition is that the variance introduced by the stochastic greedy is significantly higher in the context of exemplar-based clustering.

[Q5] Table 3: What are the values of the column headers \mu_1, \mu_2, and \mu_3?
[A5] The values are 200, 400 and 800, respectively.

[Assigned Reviewer 3]
Please refer to our response to the question 2 of the assigned reviewer 2.