We thank the reviewers for their positive feedback.$ Reviewer_6: Regarding the comparison between (k/eps samples, k^3/eps^2 time) and (k/eps^2 samples, k/eps^2 time): as pointed out by the reviewer, there is no single best algorithm among the dynamic program and our new algorithm. But with the increasing size of modern data sets, it is important to study the trade-off between computational and statistical efficiency, and our new algorithm offers an attractive point on this trade-off curve. In particular, our algorithm improves the computational complexity for fixed statistical error by k^2. Even for small values of k, a speed-up by k^2 can be important. Moreover, our experiments indicate that the empirical performance of our algorithm is significantly better than suggested by our analysis. For a given empirical MSE, our algorithm is roughly 100x faster than the dynamic program. A minor clarification: piecewise polynomial regression is a special case of the piecewise linear model we are studying in our paper, not a generalization (see line 191). Reviewer_7: As pointed out by the reviewer, our analysis is for the fixed design case, which is a standard setting for studying linear regression. We agree that an extension to random design is an interesting question for future work. Regarding the novelty of our proof techniques: Our main technical contributions are not the probabilistic proof techniques for linear regression, but the merging algorithm and its extension to linear regression. This extension builds on the classical results for fixed design linear regression and uses them in a novel way in our two merging algorithms. Detailed comments: 1, 2, 3, 5b, 7, 8) We thank the reviewer for the comments, which we will address in the final paper. 4) The variable r is the rank of the design matrix X (see line 183). As pointed out by the reviewer, r is always upper bounded by d. 5 a) We used s^2 and sigma^2 in order to state our results in full generality. We will clarify the difference between the two quantities in the final paper. 6) The algorithm of Theorem 3 does indeed *not* require knowledge of s^2. 9) Our paper mentions multiple times that we study the fixed design case (see the abstract, the introduction, and the preliminaries). 10) We agree that a dataset of size n=10^4 is usually not challenging for a nearly-linear time algorithm. We have only restricted the plots in Figure 1 to n=10^4 because the dynamic program has a quadratic running time and becomes impractically slow for larger n. In Figure 2, the dashed lines correspond to running our algorithm with larger values of n. In particular, we have run our algorithm for n up to 10^6. We will clarify this in the final paper. We agree that experiments for larger values of n would also be interesting, but the current results already demonstrate clearly that our algorithm has a much better computational vs statistical trade-off. Reviewer_8: We agree that the stated running time of our algorithm is nearly-linear in n and quadratic in d. However, it is important to note that *any* sufficiently accurate algorithm for linear least squares can be substituted into our algorithm in order to improve the dependence on d. We did not discuss this in the 8-page paper due to space constraints, but we mention this point in Section 2.5 of the supplementary material. We will move this paragraph into the final paper. For instance, using the algorithm of [CW13] improves the dependence between n and d in our algorithm. For d = O(n^1/3), this leads to a running time that is nearly-linear in the input size. There are many applications of linear regression where n is significantly larger than d, e.g., piecewise polynomial regression with moderate degree. Furthermore, “standard” one-piece linear regression is a special case of our algorithm. Hence our algorithm has exactly the same dependence on d (up to logarithmic factors) as solving a single linear regression problem. Regarding the novelty claim, please see the corresponding paragraph in our response to Reviewer_7.