We thank the reviewers for the insightful comments and suggestions. We will go through them in detail:$
Review 1:
-  "Compare to results in http://arxiv.org/pdf/1504.06729v2.pdf"

The model in the paper above is indeed very related, and we will add a discussion in the final version. As in our setting, the columns are split across 's' machines. The main differences are the following. If columns have sparsity \phi each, the paper gives an algorithm for low rank approximation with a total communication of O(sk\phi/\epsilon + sk^2/\epsilon^4); their algorithm is more intricate than GREEDY, and works even for "worst case" partitioning of the columns into machines. Our algorithm is very simple, and for a random partitioning, the communication is just the first term above, along with an extra \sigma_min(OPT) term. Thus depending on \sigma_min and \epsilon, each of the bounds could be better than the other.

Review 2:
-  "Statement of Theorem 3 is a bit underspecified. The total # of columns is \kappa(OPT)/\eps \times k/\sigma_{\min}(OPT).."

Yes, the distributed version picks a \kappa(OPT) factor more columns than GREEDY on a single machine; we will make this clear in the final version. This appears to be the tradeoff that comes with the parallelization (at least in our analysis -- see below).

-  "Is this the best possible result for the distributed setting?"

We do not know of a stronger lower bound for the distributed setting (than the one for the single machine greedy algorithm). It is an interesting open question (it is tricky because it should hold for a random partitioning).

-  "For the experiments, I would prefer to see results on the larger dataset as opposed to the smaller one. Figure 1 is on MNIST right? On the larger scale experiment, what is the performance of the 2-Phase baseline?"

Yes, Figure 1 is for MNIST; the goal of that experiment is to show that not much accuracy is lost against GREEDY when using our proposed methods, while a significant speedup is gained over GREEDY. The 2-Phase is indeed faster on MNIST; however, this is because we are still in the relatively small data regime. For the large scale dataset (news20.binary), on the other hand, the 2-Phase algorithm is simply not able to run since it depends on computing a top-k SVD, which is intractable for n=100,000.

Review 3:
-  "The paper improves upon previous work on the theoretical analysis of the greedy algorithm for CSS. However, the improvement of the bound seems not significant."

First, we note that our bound depends only on the condition number of the set of optimal vectors (OPT_k) instead of that of the entire set of vectors B (which is the state of the art). In practice, one expects the former to be very well-conditioned, whereas the latter could be (and likely will be for real data) poorly conditioned. Also our bounds on GREEDY are tight, as we mention.

Second, the analysis of GREEDY is only one of the theoretical contributions of this paper. The analysis of the distributed algorithm, via ideas from submodular maximization, and the significance of random partitioning are all novel in our opinion.

-  "In line 255, do we need to make sure that previously selected columns in B cannot be chosen again?"

In principle, a column may be chosen twice. However, the second time, it would have 0 marginal utility, and thus this case would only happen if the data has already been completely explained. (In that case, it is anyways pointless to choose more columns).

-  "In line 301, should it be k instead of k'?"

Yes, that is a typo -- we will fix it. Thanks for catching.

-  "One needs to choose r columns (r > k) using the greedy algorithm so as to be close to the optimum with k columns. This is really not bounding the performance when restricted to only output k columns. Is there a way to modify the analysis?"

This is a great question, and experimentally, we do see good performance even when restricted to only k columns. But complexity wise, it turns out that GCSS is harder than the so-called "max-coverage" problem, which implies that in general, if we restrict to k columns, it is impossible to approximate to a ratio better than (1-1/e) ~= 0.63, unless P=NP. In the practical settings that motivated our study, the goal is to explain as much of the data as possible using a small number of columns, so obtaining a (1-\epsilon) approximation, while choosing more (but not much more) columns was important.


We will make sure to make all these responses clear in the paper.