Paper ID: 1015 Title: Learning Sparse Combinatorial Representations via Two-stage Submodular Maximization Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies a two-stage submodular maximization problem. Given m different submodular objectives, the goal is to find a subset of size \ell such that the sum of the best subset of size k valuated by each submodular objective is maximum. The authors show that this combinatorial optimization problem cannot be formulated as simple submodular maximization. They propose two optimization frameworks: (1) continuous greedy algorithm, and (2) local search algorithm. They claim both frameworks admit constant approximation factor for the two-stage submodular maximization problem. In my opinion, the paper is not yet ready for publication given the following reasons: (1) The motivation of the studied problem is a bit contrived. (2) The paper lacks a number of crucial details for understanding the claims made by the authors. (3) One of the main results Theorem 3 seems not making sense: the bound becomes -1-1/e if setting \epsilon=0. Clarity - Justification: The paper lacks a number of crucial details for understanding the claims made by the authors. For instances, it is unclear how the continuous relaxation problem may be solved via LP when the objective is the coverage function in Section 3.1. Theorem 3 fails to include the discussion of the running time in terms of \epsilon. I think Theorem 3 should show the tradeoff between approximation guarantee and the running time in terms of \epsilon. In Section 4, what is the definition of the potential function? In Line 453, what does it mean by piece-wise linear relaxation of f_j? I think L_j, by definition, should be the continuous relaxation of f_j. Also, in Line 461, it is unclear how the piece-wise linearity enables computing the optimal solution efficiently. Significance - Justification: Though the studied problem is new, I am not fully convinced by the motivation of this problem. The two-stage optimization formulation could be seen as a discrete analog of dictionary learning, but the authors did not further explain how the learned discrete dictionary may be used for any practical machine learning tasks. Without thorough discussion of its motivation, the studied problem lacks practical usage. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - Though not explicitly discussed, it seems that the complexity of the local search algorithm for general submodular functions grows exponentially with k, which significantly limits its application. - It is unclear why the authors propose both the continuous optimization algorithm and the local search algorithm. The authors may want to add some discussion comparing these two algorithms in terms of the performance guarantee and running time. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The manuscript concerns the development of methods based on submodular maximization. Such methods are an active area of research and have applications to problems such as document summarization. In particular, the authors are interested in the problem of choosing a subset S of a given ground set N such that S yields a high value when given as a ground set to a set of other submodular maximization problems f_1 through f_m. This problem is challenging because, even though f_1...f_m are submodular, the overall objective function is in general not submodular and therefore it is a priori unknown whether there is any efficient way to optimize it. Despite this fact, the authors showed that this objective can be optimized efficiently using two methods (which apply in different circumstances): (1) A continuous optimization method based on the technique of deriving the multilinear extension of a submodular function. This method achieves an approximation ratio (relative to optimum) of close to 1-1/e when |S| is large. (2) A local search method that guarantees an approximation close to 1/2, and applies even when |S| is small. Clarity - Justification: The manuscript is exceptionally clear. Notes: - F(S) appears on lines 370, 382 and several other places, but I believe it is not defined until line 636. Significance - Justification: This manuscript proves several nontrivial results related to submodular maximization. Submodular maximization is an important, active field of research, so these results have the potential to be important for future work in developing new submodular maximization methods and for applications of submodular maximization such as document summarization. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - What is the asymptotic running time of the continuous optimization approach? Some potentially interesting extensions: - What if the underlying functions have different cardinality constraints (i.e. if f_i(S_i) is subject to S_i <= k_i)? - What if the underlying functions are nonmonotone? - The algorithms described have high-degree polynomial running times. In contrast, methods for (normal) submodular maximization have quadratic running times, and furthermore can be accelerated as described in (Minoux, Optimization Techniques 1978). Is there any way to adapt these methods to run faster, perhaps at a loss of performance? EDIT: I discussed with the other reviewers, and they raised some important technical concerns with this manuscript. My rating is "Strong Accept" if these concerns are handled satisfactorily in the authors' response. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies a generalization of submodular maximization, in which the algorithm has to select l items, so as to maximize the sum of values of each of m different submodular functions, when each function is applied to a maximizing subset of size k out of the l items. Two efficient algorithms are presented: one which is based on a continuous relaxation, continuous greedy algorithm and then rounding, and one which is based on local search. The algorithms obtain constant factor approximation guarantees. Several experiments are reported. Clarity - Justification: The overall clarity is good. I would prefer much more of the analysis to be included in the body of the paper instead of 3 pages of experiments. Many of the results are stated with little explanation in the body of the paper. I didn't verify all the analysis, but I do have a doubt about the proof Theorem 3.2. This theorem seems crucial for the analysis, so I hope the authors will clear this up in their response. The authors prove Theorem 3.2 by referring to their lemma 3.1, which shows that G(x) is a multilinear extension of g. However, lemma 3.1 uses G(x) as defined in the first optimization problem (eq (1)), while downward solvability, also used in Theorem 3.2, applies to formulation (6). While both formulations lead to the same value for G(x), G is not in fact the same function in both: it is defined over different spaces. It is unclear to me how it is possible to conclude something about either formulations, since both properties: (1) multilinearity and (2) the feasible regions being a downward-polytope, depend on the representation, which is not consistent across these two formulations. Significance - Justification: The presented problem and solutions seem overall interesting and useful. The authors do not explain the motivation of having k < l. If the goal is to have a set representing several categories, why not use all the elements in the set to represent them. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): My rating is "weak reject" because of the problem with Theorem 3.2 that I mentioned above. If this issue is clarified I will reconsider my rating. Comments: - In line 142 it says that local search is for "cases in which k is small", but on line 146 it says that the local search guarantee dominates when k >= 311, that is large. Which is it? Also, I haven't seen a derivation of this number in the paper or the appendix, where does it come from? - In lemma 3.1, x is in [0,1]^{n + nm} and not in [0,1]^n. - How is G(x) computed during the continuous greedy algorithm? Its definition includes an expectation over exponentially many assignments. =====