Paper ID: 707
Title: DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression

Review #1
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors utilize kernel-based distribution regression to construct the sufficient statistics to improve the performance of K2-ABC and provide accelerate computation procedure based on random features. Empirically, the algorithm outperforms other kernel based ABC algorithm, i.e, K2-ABC and SA-ABC.

Clarity - Justification: The main part of the paper is well organized. However, for self-containing and better understanding the position of this work, K2-ABC should be introduced in details. 

Significance - Justification: Rather than simply using the embedding of the empirical distributions as the sufficient statistics in K2-ABC, this paper introduces two kernel-based distribution regression methods to construct sufficient statistics for K2-ABC in order to retrieve more information from data. The experiments demonstrate the benefits of the proposed sufficient statistics.

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In this paper, the authors provide two ways to construct sufficient statistics for approximate Bayesian computation (ABC) based on kernel-based distribution regression. More specifically, one statistics is based on regression from kernel embeddings of full distributions, and the other one is based on regression from kernel embeddings of conditional distribution. To accelerate the computation procedure, the authors also provide the accelerated algorithm by using random features. Plug these statistics into K2-ABC, the performance is improved empirically.   The paper is generally well-organized and easy to follow. However, the K2-ABC, which is the most related algorithm, is not explicitly introduced, making the paper not self-contained and the position of the proposed work is not clear.  There are also several issues to be clarified.  1), What is the motivation and benefits of using distribution regression is not clear to me. From my understanding, as far as the kernel used in kernel embedding is characteristic, the obtained embedding is already a sufficient statistics. Despite the empirical performance, the reason that the authors propose such statistics construction methods which are costly in computation and its theoretical property is not clear.   2), The authors provides two ways to construct the sufficient statistics. What is the criterion for selecting the statistics in real applications? Moreover, how to decide to partition the variables to form the conditional distribution regression? These should be discussed in the paper.  3), The authors claim the number of random features can be further reduced with the algorithm from Bach (2015). However, that algorithm requires constructing a non-uniform distribution which needs to compute the leveraging score, and thus, may not directly applicable.   In sum, the paper provides an interesting way to improve existing K2-ABC. However, there are several issues need to be addressed.

=====

Review #2
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a method for automatically generating optimal statistics for likelihood-free models or ABC.  The algorithm first maps the empirical or raw observations/pseudo-observations into a RKHS (the mean embedding).  The mean embeddings are then compared using another kernel (a dissimilarity between probability distributions.  Finally, this dissimilarity kernel is used to perform ridge regression mapping (indirectly) observations to  parameters.  Thus the whole procedure avoids statistics' functions.  There are several flavours of the algorithm: conditional DR, where only some "aspects" of the observations are modeled in the DR (this was not fully understood by the reviewer) and random fourier features, to reduce the dimensionality of the observations without loss of information.

Clarity - Justification: Overall the writing quality is high.  The intro and related work is very good, but I would prefer to have it shortened so more intuition and explanation could be given to the DR sections.  As mentioned below, I felt the motivation behind the conditional DR was lacking.  Some more examples would help.  There is a lot of notation and the readers could use the help of an explanatory figure or two.

Significance - Justification: I like the idea of this paper a lot.  It is an interesting and original way of doing away with summary statistics in ABC (one of the long standing issues with the field).  I wonder about the significance of the paper due to the reliance (is it?) on random fourier features to reduce the computational complexity of mmd.  If this wasn't used, would the authors consider this a practical algorithm?  

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I didn't fully understand the significance of the condition DR.  The authors may wish to motivate and explain it better in the paper.  For instance, I can imagine that data generated away from the posterior could have a negative effect on learning the DR, but I don't think this is what is happening.  Conditional DR seems very important in the performance, yet I do not see an explanation of how the split is made in Algorithm 2 in any of the examples.  Can the authors include the values for L and M in their experiments and figures?  I prefer seeing L=100 or M=10000 so I can follow which parts of the algorithm the authors are referring to.  What happens to DR-ABC when the prior is very broad, relative the posterior?  Would the quality of the DR suffer?  I would think that the precision of the DR would decrease around the posterior since there would be so few training points.  What is the intuition for the DR here and kernel regression wrt interpolating between training points?  I definitely agree that the choice of summary statistics is an important issue in ABC.  I wonder how scientists using simulations feel about this.  Presumably the statistics they use encode the intuition about what is important in a simulation.  They can also look at coarse to fine grain statistics to understand their simulator/model.  The approach in the paper treats the statistics, the important intuitive features of the data as a blackbox (for the scientist).  Is it possible to open the box and understand what the implied statistics are?  This could help sell a method like this by demonstrating that there are meaningful "statistics" captured.  What happens to posterior predictive checks?  These are very important to analyse inference results; do we add new statistics to test in this situation to make up for the fact that the statistics are no longer in the algorithm?

=====

Review #3
=====
Summary of the paper (Summarize the main claims/contributions of the paper.): The Authors propose a new method for deriving summary statistics to be used in ABC algorithms.These statistics are obtained using distances based on embedding mean defined on reproducing kernel Hilbert spaces.

Clarity - Justification: Very nice paper, with many things inside, sometimes too many. Sometimes difficult to follow, but certainly above average. 

Significance - Justification: The Authors propose a brand new method which combines several different strategies already present in the literature. 

Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The Authors provide a nice contribution to the field of ABC As a single criticism, I would like to see more discussion about the choice of the kernels, which, apparently, play a key role in the method

=====