We thank the reviewers for their efforts and useful comments.$Reviewer 1: 1. Comment starting with“I would the authors’ contribution … would amplify the weight of their contribution”in Significance-Justification. A. We agree that additional examples would improve our paper. To apply our idea to some subclass of settings, it maybe requires much effort. However, we can add some specific examples of application. An interesting example is a stochastic rule with finite partitioning (SRFP) discussed by Yamanishi (1992) (it's in our references). Applying our idea with a slight modification to SRFP, we can derive a new risk and PAC bounds for SRFP if the penalty is (\beta/2)\{\sum_{i=1}^d \log(n_i+n\epsilon+1)-\log\theta_i(1-\theta_i)+2\} where d is the number of partition, n is the sample number, n_i is the sample number in the i-th partition, \theta_i is a parameter of the i-the partition. Furthermore, P_{\epsilon}^{(n)} \rightarrow 1 holds if the candidate number of partitioning is finite. This risk bound does not require the quantization in contrast to Yamanishi’s result. Furthermore, the first term of the above penalty is replaced with (d\beta/2)\log n if we use the worst value with regard to x^n like Yamanishi (1992). Thus, our idea makes the model description length much shorter, which improves his PAC bound. 2. Comment starting from the third last line in Significance-Justification. A. There is no reason for that our idea is applicable to only sub-Gaussian cases. To apply our idea to other cases, there are two things to prove: (i) P_{\epsilon}^{(n)}\rightarrow 1 exponentially and (ii) the loss variation term can be bounded above by the weighted \ell_1 norm. (i) is attained by at least exponential families with Sanov-type theorem. So, it suffices to show that (ii) is possible for exponential families for example. We think that it is possible for at least some exponential families. But it also requires not a little effort. 3. Comment in the paragraph starting with “The authors mention that…” in Detailed Comments. A. Thank you for pointing out this part. Surely, our writing was not good. What we want to say is the following two facts: (i) In general, the risk (expected loss) of linear regression with random design is difficult to evaluate unless the range of x, or the parameter space or a loss function is bounded. Indeed, Wainwright’s evaluation is not about a risk. One of our contribution is to manage this issue by the typical set and the conditional expectation on it. (ii) Aside it, the most previous result imposes asymptotic assumption or some other technical assumptions like (15) and (16) in Wainwright’s result. Such assumptions are usually hard to confirm in practice. In contrast, our main result requires no such assumptions except Gaussianity of x^n. So, we think this direction of study is also worth studying in addition to the past works. 4.“The proof of Theorem 6 really should appear in the supplementary material. “ A. The reviewer is right. We can submit it if possible. 5. “Is it possible to interpret the main result the case when a sparse solution, or approximately sparse solution, is optimal?” A. We reply by taking the meaning of “a sparse solution is optimal” as “the true parameter is recovered precisely by \ell_0 norm (or \ell_1 norm with some conditions like RIP). Unfortunately, we didn’t find any exact and explicit relationship with the sparsity. However, if the true parameter is sparse and the lasso recovers it, then the lasso estimator can be expected to makes its target function small to some extent, which also makes the risk bound small. 6. Comment in the paragraph starting with “Can you explain further the statement…” A. Thank you for pointing it out. Actually, we missed explaining an important premise. To our knowledge, there is no way to take the minimum of (3) directly. Currently, an approach based on the technique of Chatterjee and Barron (2014b) is the only way to evaluate it with its tight upper bound. The discussion around here postulates this technique. We can see that this term is not bounded with regard to x^n after taking expectation with respect to \tilde{\theta} in their derivation. Aside from this, there is another strong motivation to make L dependent on x^n. If not, L must bound the maximum of the left side of (3) with regard to x^n, which makes L much larger in general. This is unfavorable in view of the MDL principle. 7. Comment in the paragraph starting with “I don't understand the second-to-last” A. The reviewer is right. We said too much. We should say that this issue can be possibly solved if \tilde{L} can depend on x^n. Reviewer 2 As for the comments about the presentation, we agree. Thank you. A correction: From line 409, we said any codelength valid penalty is risk valid in fixed design cases. This is incorrect. Any risk valid penalty is codelength valid in fixed design cases too.