Paper ID: 1287 Title: Fast Algorithms for Segmented Regression Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a new algorithm that learns the k-piecewise linear function in O(nd^2) time (as opposed to O(n^2d^2) time with dynamic programing). Clarity - Justification: The paper is generally pretty clear. The only complaint is that I am not sure why this is called nearly linear time. The dependency on d is quadratic -- and I don't quite see why it's necessary in principle (or maybe I missed something obvious). Significance - Justification: It's pretty impressive that the statistical error only get worse by a factor 2 or 3 but the runtime can be a factor of n faster. It's really nice to identify the inefficiency of the existing DP algorithm. Though the algorithmic idea is not super surprising/novel, given many attempts in speeding up similar kinds of questions with approximation guarantees (instead of exact solutions) in computer science . Though those are purely computational questions and here there is statistics involved, therefore I regard the main contribution of the paper as developing faster (improper learning) algorithms for statistic questions. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper gives a nice algorithm that runs faster than the previous best DP algorithm by a factor of n with suboptimal statistical guarantees. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes an algorithm with O(n log n) running time for piecewise linear regression (with squared loss). This is fairly natural problem. Obvious DP solution takes O(n^2 * k) time where k is number of pieces. This is a fixed design regression problem (i.e. the set of x's is fixed, there is no distribution from which x's are generated, and we do not care about generalization to other points in the space). The paper assumes that data is generated according to piecewise linear function corrupted by sub-Gaussian noise. Clarity - Justification: The paper is fairly understandable, mostly because the problem is quite easy. There is a good number of mathematical typos, mostly simple mistakes that can be easily fixed, but they make the paper look sloppy. See details in comments below. Significance - Justification: Piecewise linear regression is a simple and useful problem. Having algorithm with sub-quadratic running time is useful. The paper seems correct, but I didn't check the calculations in great detail. The proof techniques are fairly standard. The proofs are based on existing results for fixed design linear least squares regression. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) Definition 1 is sloppy. The wording w.r.t. the "fixed, known j" is confusing. First, writing "for all x there exists j" is mathematically incorrect. I assume the correct wording should be "f is such that there exists theta_1, ..., theta_k, k-partition of reals into interval I_1, ..., I_k and index j such that f(x) = (theta_i*x) if x_j lies in I_i". But it's still not clear if j fixed, in the sense that the family of piecewise linear functions should be parametrized by both k and j. Or, whether L_k is union over of these families over all possible choices of j. For example, in the piecewise polynomial regression example, j is fixed to j=2. Please make the wording of the definition more precise. 2) Sentence "We remark that this model as contains ..." is not grammatical. 3) Section 2, the t-piecewise degree d polynomial regression: Please replace t with k. There are absolutely no reason to introduce a new letter t. (Readers have limited memory to keep all the letters in their heads.) 4) Theorems 2, 3, 4, 5 and many other lemmas: What is variable r? You probably want to replace it with d. 5) Theorems 2, 3, 4, 5. Explain the difference between s^2 and sigma^2. Consider simplifying the presentation to presentation to N(0,1). 5) Theorem 2,3,4. The algorithm needs k as an input. It's not specified in the statement of the theorem. 6) Theorem 3. Prior to the theorem, you explain that the better algorithm does NOT need to know s^2. But in Theorem 3, you assume that the algorithm receives it as an input. Please fix. 7) Lemma 6. Please don't use words such "as above" in statements of theorems/lemmas. Statements should be readable without any context. 8) Lemma 6. Replace "for all I \subset [n]" with "for all intervals I \subset [n]". The lemma as stated is clearly false, since there are 2^n subsets. 9) Please stress early in the introduction, that this is fixed design problem. People also study generalization errors (i.e. where there is a distribution over x's and x's sampled i.i.d. from it). 10) With O(n log n) algorithm I would suggest doing experiments on 10^7 points or more points. Preferably 10^9 points, on a cluster. 10^4 points is a tiny. (Dense linear regression is done routinely on 10^10 points.) ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Suppose there is an unknown piecewise linear function f with k pieces. (One can generalize to piecewise polynomial of degree d for small d, but let me ignore this here for simplicity.) One gets n pairs of the form (x,y), where y = f(x) + eps, where eps is a (sub)gaussian noise random variable (independent for the pairs). The task is to output a hypothesis f' with good mean-squared error. Ideally, the hypothesis would itself be k-piecewise linear, but the authors allow themselves O(k)-piecewise linear hypotheses. This problem can solved optimally (with a k-piecewise linear hypothesis) in time O(kn^2) using dynamic programming. The resulting error will be roughly k/n. The authors' contribution is to give an (essentially) linear-time algorithm for the problem; however, the resulting error will be roughly sqrt(k/n). As mentioned, the authors also allow themselves to have O(k) pieces; in particular, with a little slowdown they can get to 2k+1 pieces. To compare: Suppose one desires error eps. Then the DP algorithm needs n ~ k/eps, and therefore takes time ~k^3/eps^2. By contrast, the authors' method requires ~k/eps^2 samples, but then only takes time ~k/eps^2. The algorithm used is pretty sophisticated, and is based on recent work of (I'm guessing) a similar set of authors from PODS '15. Clarity - Justification: The paper is well-written. Significance - Justification: I think the paper is pretty significant; it's a natural learning problem, and the techniques are far from trivial. It's honestly hard for me to guess whether in practice whether (k/eps samples, k^3/eps^2 time) is better or worse than (k/eps^2 samples, k/eps^2) time. At some points the authors suggest that one should think of k as largeish (so that their run-time is better) but at other times suggest it is not so large (e.g., in their algorithm for reducing the number of parts in the hypothesis from O(k) to 2k+1). In their experiments, k is typically on the order of 5, so one is looking theoretically at only a speedup of 25x. One could imagine an eps like 10^{-2}, in which case the amount of data needed for their algorithm is 100x. In any case, I found the theoretical component of the paper compelling; perhaps in practice it is a good choice if data is very plentiful. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): . =====