We are grateful that all reviewers appreciate our work. While the critique is mostly favorable, R4’s summary score is a bit tepid due to a few doubts. We assure R4 that the paper was written carefully and extensively revised multiple times to appeal to the broadest readership possible. All his doubts are resolved below: we are confident that after these clarifications, R4 will agree that our treatment/presentation is rigorous and the short term impact of this work is immediate.$%%
R2: Are (5) and (6) equivalent?
The two problems *in* (5) viz., the penalized and constrained models are equivalent: a one-to-one correspondence between lambda and tau (using Lagrange multipliers). Similarly for (6), but not (5) and (6).
%%
R2: Why is u a decision variable (L335)? (8) and (9) equivalent?
Here, u is the largest eigenvector of the *reduced* set and so is a decision variable itself. That notation means that u is the optimal solution of that problem to find the leading eigenvector. Since largest eigenvector of M = smallest eigenvector of M^-1, Model (8) == (9).
%%
R2: Why are bar plots (i)-(l) from the top?
The bar plots are truncated at the top to make visualization simpler (see L752-762). We will include the full plot in supplement.
%%
R2: True that “Choosing xi such that maximal eigenvectors of X^TX are preserved”?
R2 is right (see L313-335). We will add a toy example in supplement.
%%
R4 asks what "Hessian carries most of the curvature information and the optimal value" means?
Eigenvalues of the Hessian are called “principal curvatures” in differential geometry and play a critical role in analyzing 1st order methods for (5), (6). See Dickstein, "Fast large-scale optimization by unifying stochastic gradient and Quasi-Newton." (2013).
So, ED-S has strong guarantees directly if, as R2 notices, the spectrum of full and reduced sets are the same, iterates generated (and optimal solutions) will be similar. We will rephrase.
%%
R4 asks what 1) "Large eigenspectrum of Hessian" and 2) “Objective touches feasible set” means?
1) The #significant eigenvalues is large.
2) In Fig 1, where the red contours “touch” the norm ball is the optimal solution: that feasible point has the least objective value.
%%
R4: Algorithms scalable?
Yes! This is the motivation for ED-I over ED-S. Huge datasets are no problem; coordinate descent at each iteration only requires the derivative in that coordinate (via the Sherman-Morrison formula). We will add these details.
%%
R4: Similarities to the batch active learning. Other references?
R6: Krause’s work?
We are familiar with Krause’s and Qian’s work. While it may *seem* that the references are a baseline, this is actually not true: the core difference between active learning (in most settings) and our setup is that the experimenter is ***NOT allowed to get yi’s on a cost to know basis***. For example, consider operationally about setting up a longitudinal study in 2016. Either the subjects are in the trial (and whose responses will be obtained in 2018) or not.
With this clarification, R2 will see that it is non-trivial to modify active learning methods to our setting. Qian’s work is unrelated: it is covariate data agnostic and is inapplicable here.
References in our paper subtly acknowledges this as does our discussion with some authors pointed out by R4, in the last 12 months.
%%
R4: LASSO regularization controls sparsity. In ED-I, interplay with exp-design objective?
Ideally, for any choice of LASSO parameter, the feature selection with full and reduced set *should be the same* since the parameter is not known. We evaluated this explicitly, see Fig 2b, 3a,b,c,d.
%%
R4 asks what 1) "the objectives of the full and reduced set behave similarly" and 2) "log-det captures the linearity” means?
1) means that the regression problems of the reduced and the full set are 2-norm close.
For 2), Log det corresponds to D-optimality for linear regression. The objective in ED-I and ED-S have 2 pieces, the one from linear regression (log det) captures linearity.
%%
R4: In what sense is pipage rounding more “powerful”?
In theory CS, pipage rounding is extensively used to obtain approximation guarantees for many intractable problems. Even for submodular maximization (Krause’s work), pipage rounding often guarantees a constant factor approximation (L545-555).
%%
R4: Sec 4. Optimal choices for budget is ~400? Optimal in what sense?
The R^2 change of reduced vs. full model (Fig 3f) was used to pick a “good” budget. As R6 also notes, full dataset is always better than the reduced, but here we refer to the smallest budget that approximates the full model in R^2 change as “optimal”.
%%
R6: Fig 3f says "you need a lot of budget to do well"?
Yes. For this dataset, while enrolling 400 subjects is sizable, it offers significant 50%+ savings over a cohort size of 1000.
%%
R4: Term u'(...)^-1u convex?
This is from Boyd/Vanderberghe, page 76.