Paper ID: 1332 Title: Horizontally Scalable Submodular Maximization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper shows a multi-round distributed algorithm for submodular maximization subject to a cardinality k constraint. In r rounds, the algorithm achieves a 1/(2r) approximation using roughly n^{1-1/r}/k machines, each with memory k*n^{1/r}. Similar guarantees are also shown for other types of constraints wherever the greedy algorithm works well. Clarity - Justification: The presentation is clear. Significance - Justification: The theoretical contribution seems small as it seems to be largely adaptation of previous works. However, the experimental results are good and the algorithm is simple and either competitive or better than previous approaches in all dimensions: number of machines, work per machine, number of rounds. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The algorithm and the proofs are natural extensions of previous works by Mirokni&Zadimoghaddam’15 and Barbosa et al. ’15. In the previous work, the elements are distributed among machines, each compute a solution of size k, the solutions the aggregated on a single machine, and a final solution is the computed. The new algorithm is the natural extension to multiple rounds. Now in the second round, the solutions of the first rounds are randomly distributed to a (smaller) number of machines. Then the machines computes a solution from the elements assigned to it and the process continues until the number of remaining elements is small enough to fit in a single machine. The proof is similar to the previous proofs. Previously, it is shown that the sum of the expected solution in the first round and the second round is at least a fraction of OPT, which leads to a factor of ½ loss compared with the centralized solution. In the new proof, it is shown that the sum of the expected solutions in all the rounds is at least the same fraction of OPT, which leads to a factor of 1/r loss compared with the centralized solution. Compared with the previous work by Kumar et al. ’13, the number of rounds is smaller by a factor of log(Delta) but at the cost of a much weaker approximation ratio (constant versus 1/r, which degrades with more rounds). It seems that the guarantee is too weak to be interesting as we know from Kumar et al. that a constant approximation is possible with a deterministic algorithm while the guarantee here is probabilistic and also degrades with more rounds. However, the experimental results are good and suggest the algorithm might perform much better in practice. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper studies how to scale up the submodular maximization horizontally. In contrast to existing distributed algorithms that often assume \mu > O(\sqrt{nk}) (\mu is the capacity of a machine, n is ground set size, and k is the cardinality constraint), this work considers an interesting scenario where this assumption does not hold any more. To deal with this scenario, they design a multi-round compression framework where, in each round, they first partition the data into some number of machines with each machine having about \mu data items, then obtain a size k summary from each machine, and lastly use the union of the summaries from each machine as the ground set for the successive round. They show that the number of rounds by this scheme is bounded logarithmically and also show that this scheme obtains 1/2r-approximation guarantee with r being the number of rounds. They empirically evaluated this framework on a number of tasks. It seems that the proposed method yields solutions close to the centralized greedy algorithm for extremely small capacity \mu. I think the paper introduces an interesting and important contribution for further scaling up the submodular maximization problem despite many of the existing work on this topic. Though the presentation of the experimental results could be improved, I think the paper should be accepted. Clarity - Justification: The paper explains clearly and easy to follow. I enjoy reading it. Significance - Justification: The paper considers an novel scenario for distributed submodular maximization, introduces a simple multi-round optimization procedure, for which theoretical analysis is given. The restricted scenario considered in this work is of great interests for extremely large-scale applications. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some comments and questions: (1) It seems that the proposed framework works for the scenario: \mu > k. It may also be interesting to investigate an even more restricted scenario: \mu < k. Maybe, a drastically different scheme needs to be proposed to address this scenario. (2) The data partitioning step needs to be performed in each round, which may be a significant computational overhead, especially when the number of rounds is large. I guess repartitioning the data every round is needed for the purpose of the theoretical analysis. From an empirical perspective, is there any way to eliminate this repartitioning step while still retaining very good empirical performance? Minor suggestions: (1) In the description of Algorithm 1, add the statement "A_0 < - V" before line 280. (2) In Figure (2), do not plot the random baseline so that the comparison among Greedy, Tree, and RandGreeDI can be more visible. (3) In Figure 2F, instead of using LAZY A and LAZY B as the legend, it is better to refer them as LAZY with capacity 0.05% and 0.1%. (4) In Line 742: necessairy - > necessary (5) It would also be more complete to show some run time analysis of the proposed method. It would be interesting to know how the run time of the algorithm scales with the capacity. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Recently there has been a lot of work on developing distributed algorithms for monotone submodular maximization, subject to cardinality (or sometimes more complex) constraints. However, all distributed algorithms implicitly assume that there exists at least one computer with storage capacity that grows with the size of the ground set. More concretely, if n is the side of the ground set of items and k is the cardinality bound, then all of these algorithms (except those of Kumar et al. (2013)) assume a machine with storage capacity >= \sqrt{nk}. This is impractical for many webscale datasets. This work proposes a practical solution, Tree-Based Compression, outlined in Algorithm 1. For a fixed machine capacity, the data is first randomly partitioned across however many machines are necessary to store it. Then, a solution of max size k is computed on each machine. These solutions are then themselves randomly partitioned across however many machines are necessary to store them. (This will be a smaller number of machines than were needed for computing the original solutions.) Again, each machine will compute a solution set of max size k, and these solutions will be partitioned across machines. This process continues until all the remaining solutions fit on a single machine. The authors show (Theorem 3.3) that as long as machine capacity, mu, is greater than k, then Tree-Based Compression has an approximation guarantee of 1/2(log_{mu/k} n / mu + 1). The denominator here is a bound on the number of rounds that the algorithm has to go through (see Proposition 3.1). The algorithm is also shown to have a guarantee in the presence of more complex constraints (see Theorem 3.5). The paper concludes with a demonstration of the algorithm's performance on several large-scale real-world datasets. Clarity - Justification: The paper is generally well-written, with sufficient explanation in the proofs to make them understandable. Please see the "detailed comments" section for a few remarks on organization and questions about things that didn't seem to be fully explained. Significance - Justification: The work that is most similar to this one in terms of the machine sizes that it requires is Kumar et al. (2013). That work's GreedyScaling and ThresholdMR algorithms only require O(n^{delta} k log n) storage space per machine. While it is true that this grows with n (whereas the capacity required by the Tree algorithm does not depend at all on n), the growth should be quite slow. Hence, these algorithms seem like they should be practical even for the largest datasets experimented on in this work. Unfortunately, it is somewhat hard to judge whether the Tree algorithm is more practical than these, as none of the experiments use GreedyScaling or ThresholdMR. (Is there a reason why these are not compared to?) Still, this paper makes substantial enough theoretical contributions, as mentioned in the summary section above, such that it should probably be published regardless. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Review summary: This paper is fairly clearly written and the theoretical contributions seem significant. If the authors can adequately justify in their response the lack of empirical comparison to the methods from Kumar et al. (2013), then this paper probably deserves a "strong accept". Other notes: Line 193: "greaten" should be "greater" Line 186: G_k isn't defined in this work at this point. It would be good to have a couple sentences here giving intuition as to what G_k is and when it would be non-monotone. Is the potential non-monotonicity of G_k the reason why neither of the Kumar et al. (2013) algorithms are used in the experiments? Line 367: "a set S of size with" should be "a set S of size <= k with" Experiments: The organization in this section is a bit confusing. Maybe just describe all of the datsets up front instead of interspersing them with the tasks? Figure 2: 1) What exactly is the relative approximation error shown on the vertical axis in Figure 2? Is it the ratio of the f-value returned by a given algorithm to the f-value of the centralized greedy solution? Regardless, it seems a little strange to call it "error", since usually minimizing error is considered a good thing, whereas in these plots the lower values are worse. 2) Do you have any intuition as to why the stochastic methods work well on WEBSCOPE but not as well on the TINY dataset (plot F in Figure 2)? Table 3: What are the values of the column headers \mu_1, \mu_2, and \mu_3? Line 640: "we can approximated it" should be "we can approximate it" Line 847: "allows to solve" should be "allows us to solve" =====