We thank all reviewers for their careful review. We address some questions/comments below:$ Reviewer 4 - Grouping words; Discussion of parameter of interest. Yes, one key insight in our paper is that we can use the sparsity of the topic vector to reduce the sample complexity. In typical scenarios we can think of k = 100, and r = 3. Suppose we want an accuracy of eps = 0.1, with our algorithm in the best case we would need roughly r^2/eps^2 = 900 words. If we have to depend on k, even in the best case we would need at least k/eps^2 = 10000 words which is very impractical. Your "grouping" idea is in fact very similar to our inverse matrix B. In some sense we use linear program to compute the optimal way of grouping words so they represent one particular topic. - Greedy algorithm using likelihood for selecting topics. Choosing a topic based on likelihood is likely to favor a "broader" topic. A more specific topic will pay a large penalty on words that simply do not appear in that topic (those words should be explained by other topics) while a broader topic will have much better likelihood. For example, consider a case where there are three groups of words, topic 1 is uniform on groups 1 and 2; topic 2 is uniform on groups 2 and 3; topic 3 is uniform on all three groups. This example has a condition number of 3 under our definition so our theorem works. On the other hand, even if the document is a uniform mixture of topics 1 and 2, topic 3 is still going to be the topic with the highest likelihood. Reviewer 6 - Comments on experiments, performance is worse than Gibbs sampling. We strongly suspect the provable algorithm can be improved to match Gibbs sampling; this is only a first cut. - The algorithm cannot trade better results for slower running time. Correct, but we hope the current analysis can be extended to analyze a range of algorithms that handle the situation described by the referee.