Paper ID: 292 Title: Experimental Design on a Budget for Sparse Linear Models and Applications Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper looks at general experimental design, where the problem is to choose at which inputs to perform costly querying of a function such that some estimate of the function itself is arises with spending too much. It specifically looks at linear regression with L1 regularization, and "costly" being defined as being able to spend a total of B unit-cost queries. They provide a new algorithm for this problem. They perform experiments on some standard datasets and a big neuroimaging dataset for Alzheimer's patients. Clarity - Justification: The paper is well written, with only some minor "density" issues in the latter stages (e.g. the Rounding subsection is a bit crammed). Significance - Justification: The problem is a classical one, but I believe Section 3.1.1 (the more-efficiently-computable-but-still-equivalent model of the LASSO sparse ED problem) is a novel contribution, and a nice one. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, I enjoyed reading the paper. Not many minor comments. The most major would be: in terms of experimental comparison, comparing against urand and 1-mean alone seems a bit light; surely one could've plucked out some adaptations of methods from the Andreas Krause crew or similar to test against, too? -- but I don't see this as fatal. It's especially nice how much larger the neuro dataset is than the other two standard ones. Side question: the result in Figure 3f, basically saying "you need a lot of budget to do well" -- do you think that's a product of no algorithm for sampling doing well yet, or the dataset itself? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes two experimental design strategies for l1 regularized linear models. One is based on the Hessian of the regression problem and the other is based on ideas from compressed sensing. They validate their strategy on several numerical experiments. Clarity - Justification: The paper is well written in clear English. Previous literature seems to be well acknowledged and the problem is well motivated. Significance - Justification: While I understand the mathematical problem of experimental design, I have no good overview over the literature. Therefore, I cannot really judge how significant the contribution of that paper is. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): At several points in the paper, I could not fully follow the authors reasoning. I hope the authors can clarify that for me and improve the paper accordingly: 1) The motivation via the Hessian sounds a bit vague to me. In the end, you are trying to choose the x_i such that the maximal eigenvectors of the X^TX are preserved. Wouldn't that be a better motivation? 2) I do not see how the two problems in (5) and (6), respectively, are equivalent. I see that they both lead to the same objective function, but I don't see how they would yield the same solution since the feasible sets are different. 3) I don't understand why u is a "decision variable" (line 335). To me it seems as if it is an eigenvector. In general I am not sure whether I understand the notation u \in argmax_v ... . Because of that I don't see why the problems (8) and (9) are equivalent. 4) Line 655: The fact that H0 is true does not imply large p-values. It implies uniformly distributed p-values in [0,1].I also don't understand plots (i)-(l). Why are there bar plots from the top? * Minor comments 1) Line 335 is incorrect. It should be: lambda_max = max u' M u = 1/ min (u' M^-1 u) 2) Line 678, typo: acheive; directly below: behabiour ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper studies experimental design for the sparse linear model. The idea of experiment design is to select a subset of possible feature vectors (given some budget) to maximize some pre-defined notion of expected information gain when the corresponding dependent variables are requested (measured). This is akin to active learning except that everything is done in a single batch. This is a classical problem in statistics going back to early 20-th century, with a very rich literature. The paper's contribution is to consider experiment design specifically for the sparse linear model (LASSO), as opposed to plain linear regression. The paper adapts the D-optimal criterion from the ED literature (the log-det of the X'X matrix) -- and adds additional terms to the objective to encourage LASSO to find good solutions. The first proposal is to have a subset of vectors whose covariance matrix has similar eigenvalues to that of the full set. The second proposal is to add a penalty so that X has good RIP constants -- and they use 'incoherence' instead of RIP for tractability. Clarity - Justification: While the paper's topic is quite interesting, the paper is written very carelessly, which makes it harder to read and to take seriously. There are a number of mistakes and typos in math -- some of which can be corrected, and some others need careful explanation from the authors to consider this paper for ICML. I'd suggest to reject it to give authors more time to write it carefully. some examples: problem (14) is not a convex optimization problem (contrary to the statement in proposition 5), since it includes {0,1} constraints on mu. (do you mean the convex relaxation?) Intro: Ridge regression (and other quadratic regularizers) is not considered robust (in the sense of Rousseuw and estimator breakdown)-- since large outliers can change the solution arbitrarily. Intro: The reference by Recht et. al -- has little to do with experiment design -- it's about avoiding synchronization in SGD!? Tikhonov regularization (adding lambda*I to the X"X term) is NOT equivalent to min-norm least squares -- that solution is obtained by solving min ||beta||^2 s.t. y = X beta. Biconvex optimization is not convex -- and in general has no guarantees. (section 3.1.1) Section 3 -- first paragraph. Log-det and other convex functions... Log-det is concave! You're maximizing log-det, (or minimizing negative log-det -- which is convex). "Recall that the Hessian carries most of the curvature information and the optimal value" -in Sec 3.1. What does this mean??? "The objective function .. touches the feasible set"??? What does it mean? Isn't the objective function defined over the entire feasible set? Do you mean the level-set of the objective function? "The eigenvalue spectrum of the Hessian is large" ??? What does it mean? You probably mean many significant eigenvalues? and many more such examples.. Significance - Justification: The topic is interesting, and I have not seen the formulation proposed in the paper before. I am a bit surprised that the methods are effective -- but the experimental results suggest that it is. The authors should comment on computational complexity. Lasso regression is most interesting for large problems with many features -- and the biggest experiment considered in the paper has 1000 subjects and 118 features, and take 7 to 10 minutes to run the proposed approach. Can this be scaled to larger problems? The experiment design problem has many similarities to the batch active learning problem. The baseline evaluations are pretty weak (random subset, and an arbitrary iterative ball-removal procedure). A stronger baseline would be obtained by running k-means clustering -- (where the number of clusters is the allowed budget), or better yet convex exemplar clustering. e.g. in the following paper: A Convex Optimization Framework for Active Learning Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, Shankar S. Sastry The goal there is to provide a good summary of the entire set of subjects -- which is closely related to the goal in this paper. There is some literature on experiment design for the sparse linear model that the authors should cite and contrasted with the current paper: -- Experimental design for efficient identification of gene regulatory networks using sparse Bayesian models, Florian Steinke, Matthias Seeger , Koji Tsuda --Deng, Xinwei, C. Devon Lin, and Peter ZG Qian. Designs for the Lasso. Tech. Rep. Madison, WI: Univeristy of Wisconsin-Madison, 2011. --Simulation screening experiments using Lasso-optimal supersaturated design and analysis: A maritime operations application, Dadi Xing ; Hong Wan ; Michael Yu Zhu ; Susan M. Sanchez A classical reference on optimization with ortigonality constraints is by Alan Edleman: Edelman, Alan, Tomás A. Arias, and Steven T. Smith. "The geometry of algorithms with orthogonality constraints." SIAM journal on Matrix Analysis and Applications 20.2 (1998): 303-353. In the second formulation the experiment design is used to select subjects x_i so that LASSO can find an optimal integral solution. However -- LASSO has a regularization parameter which controls the sparsity. How does it interplay with the experimental design objectives? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall I feel that the topic is interesting, and given the good collection of references (some of) the authors are very knowledgeable. However, the paper contains a number of mistakes and typos and meaningless statements -- where many of them can potentially be corrected -- but it feels that the paper was written in a rush and not double checked. The experiments can also be improved -- by comparison to stronger (or rather less weak) baselines. I would recommend the authors rewrite the paper very carefully and resubmit. Additional comments (please see previous sections for other comments): abstract: "we obtain tractable algorithms for this problem and also hold... " -- grammar -- "that also hold" "The latter experiment suggests that these ideas may play a small role" ???? Do you mean it's not very important? then why write a paper? perhaps that these ideas play a role.. Justify why the term u'(....)^-1 u is convex. This isn't obvious to all readers. "the objectives of the full and reduced set behave similarly" -- what does it mean? Are close to each other in some norm? Section 3.1. "log-det captures the linearity and R_G captures the geometry"... How does log-det capture linearity? in what sense? Sec. 3.2.2. "The rounding scheme if applicable is still powerful"??? What does it mean? Do you mean that it's empirically gives good results? What happens if there are more than two fractional coordinates? Does order matter? After eqn. (13) mu --> mu^2 . It's equivalent for the binary problem, but not for convex relaxation. why is this useful? Section 4. "Optimal choices for budget is about 400"??? optimal in what sense? Wouldn't you always want to have the full dataset if the budget allows? =====