We thank the reviewers for the careful reading of the manuscript and for their insightful and constructive comments. The missing details, requested clarifications and additional references will be included in the final version as instructed.$ Kernel choice [Rev1]: Choosing the kernel family will affect performance and requires some knowledge about the characteristics of the data – this is typical of kernel methods. There is already a large body of literature addressing this (for guidelines cf Ch4 in Rasmussen&Williams), so we refrained from an extensive discussion on this but will add a clarification. For our method, an important criterion is the characteristic property of kernels on which we briefly comment. Once an appropriate kernel family has been selected, hyperparameters such as kernel bandwidth can be selected via cross-validation, a common practice in supervised methods. Relationship to K2-ABC [Rev2]: We will overview K2-ABC more thoroughly and make this relationship explicit in the final version. Namely, while K2-ABC uses smoothing on the space of embeddings of distributions (akin to Nadaraya-Watson estimator), our approach determines the summary statistics optimal with respect to a certain loss function – which, in the case of squared loss, corresponds to kernel ridge regression from the space of embeddings of distributions to the parameter space. Benefits of distribution regression (DR) [Rev2]: Following SA-ABC, we use regression to the parameter space in order to estimate the optimal summary statistics. However, instead of regressing from the concatenation of the data itself or some transformation of it (as in SA-ABC) which is often a difficult high-dimensional regression problem and/or uses heuristics for transformation, we encode the distributional assumptions on the data (whether the data is assumed to be iid given the parameter or whether the parameters correspond to a certain conditional distribution in the model) by using embeddings or conditional embeddings as inputs to the supervised learning. Thus, we can more accurately estimate the optimal summary statistics. Random Features [Rev2]: We agree that the method of Bach'15 requires sampling from an importance distribution which is in general intractable. The reference is simply pointing to an active research area with results that random features can in some cases lead to a provable reduction of computational cost without sacrificing statistical efficiency. After the submission of the manuscript, we became aware of further advances in this field (Rudi et al, Generalization Properties of Learning with Random Features). The lack of clarity in this instance will be rectified in the final version. DR-ABC vs conditional DR-ABC (CDR-ABC) [Rev2&Rev3]: The decision which method to use will depend on the modelling assumptions. In the final version, we will give further clarifications. CDR-ABC is preferable when inferring parameters which naturally model certain conditional distribution of one set of observed variables given the other, eg. a transition operator in time series data - this is the case in the blowfly model where the parameters model the transition from N(t-tau),...,N(t) to N(t+1) which is the conditional distribution used for CDR-ABC on this data. Another example for CDR-ABC would be when measurements are made at known spatial locations and the model parametrises the conditional distribution given those locations. When there is no such natural split between the data, DR-ABC can be used as it models the joint distribution of all observed variables. Values of L and M [Rev3]: This information will be made explicit as instructed (currently provided in figure captions). Broad prior [Rev3]: Indeed, when the prior is extremely broad, distributions of the simulated data (training inputs) may be far from the distribution of the observed data (test input) which may affect the quality of the fit of DR. Any regression-based technique would have a similar problem though - and the advantage of DR over regressing from the concatenated data as in SA-ABC is precisely to use inputs with a more meaningful distance (using CDR-ABC helps further in this respect, by conditioning on variables whose marginal distributions are not modelled by the parameters to be inferred). Interpretation of statistics [Rev3]: Indeed, by constructing summary statistics via nonparametric regression, performance improvement with respect to a given loss function is achieved at the expense of having difficult to interpret statistics. While it is in principle possible to study the fitted function and how it depends on empirical moments, posterior predictive checks and model criticism would likely need to be performed in a nonparametric fashion as well (e.g. following Lloyd&Ghahramani'15, Statistical Model Criticism using Kernel Two Sample Tests). We thank the reviewer for this useful comment and will add the discussion to the final version.