We thank all three reviewers for their detailed and insightful comments. We will apply recommended clarification and minor comments to the revised version.$
R1
==
L95 please give an accessible reference for details on LambdaCDM.
L184 please give a reference to the "standard maximum likelihood approach"...
A: For this, we will add the canonical reference [1]  to the revised version, which discusses standard LCDM and max-likelihood approaches pedagogically.
[1] Dodelson, Scott. Modern cosmology. Academic press, 2003.

-Section 2: could you possibly apply ABC (approximate Bayesian Computation; see e.g. [Marin, Pudlo et al., Approximate Bayesian computational methods, Statistics and Computing, 2012])?...

A: We agree that ABC sounds like an interesting direction when using the forward model that can produce summary statistics for a given set of parameters. However, this is similar to the classical approach that uses the power spectrum for summary statistics and relies on our cosmological knowledge to produce this spectrum for a given set of parameters. Here, we show that using conv-nets, which 1) learns higher order statistics and; 2) does not rely on our knowledge of cosmology, could work very well. We therefore believe that using ABC although intriguing, is tangential to our main contribution.

-Figure 2: the power spectrum approach seems to be biased at large Omegas. Do you have intuition why? What measure of uncertainty do you have on your estimates?

A:  The standard constraints on the two parameters \sigma_8 and \Omega_m are degenerate using only the power spectrum, with a characteristic relationship that causes the max-likelihood estimate to become progressively more independent of \Omega_m as it gets larger.  Although we can still get a good estimate, especially using conv-nets, we are starting to see this bias for the power spectrum method.

- Figure 4: do you have an intuition why the two approaches seem to have opposite biases?

Interesting observation. We cannot completely explain this. However, since the difference between the time-steps increases super-linearly in terms of redshift z (note the logarithmic scale), the L2 loss penalizes error more severely at higher redshifts. Because of this, both method show small bias and low error at higher redshifts. In a more sophisticated treatment, one could use an appropriate Bregman divergence substitute for L2 norm. Here, using z(t) as a monotonic function defining the divergence would prevent this preferential treatment of the current loss function.

- L401 why these particular parameters?

We need to estimate the covariance matrix for a single assignment to the parameters. These particular parameters provide the best-fit Lambda-CDM values to the data from the Planck satellite telescope, which is the state-of-the-art measurement of the cosmic microwave background.


R2
==
...is there any high level structure here? The Fourier representation used in analysis in 3.2 does pretty well, and this is a relatively shallow method; one could train a shallow convolution network that will presumably learn the same types of features.

Although we did share the reviewer’s intuition, experimental results suggest that low-level Fourier features -- such as kitchen-sink features used with double-basis estimator (2BE) -- are not enough for the first problem setting -- i.e. we could not train 2BE for this task. Note that the shallow method of section 3.2 does not only rely on some training instances (to estimate the covariance matrix), but also uses the cosmological principles to predict the power spectrum for a set of parameters. While it is hard to quantify the levels of structure in our data, we performed experiments to find the proper depth for our conv-net. Reducing its depth deteriorates the results, especially since the number of convolution kernels is limited by the GPU memory and we could only use larger number of kernels after application of average pooling.  

There are only 2 kernels in the lowest layer of the network, with 3^2 parameters each. Have you looked at the learned kernels?

A: Yes. However, minimal kernels (2 x 3^3 in this case) cannot show meaningful patterns. In fact when dealing with small number of feature-maps, even “random” kernels are competitive.

How long does each cosmology simulations take?
A:  6 cpu hours on 2GHz processors. Note processing time is not the only constraint as each simulation needs more than a Gigabytes for storage of a single snapshot.

Does the CNN approach have an advantage/disadvantage compared to the other methods in terms of training data?
A: CNN does not have any advantage to 2BE in terms of the training data. However, the prevalent technique using two-point correlation mainly relies on our cosmological knowledge rather than the training data in producing the power spectrum for a given set of parameters. 


R3
==
We are glad to hear your favourable feedback.