Paper ID: 1093 Title: Learning Population-Level Diffusions with Generative RNNs Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper approaches the problem of modeling a stochastic population-level diffusion process given cross-sectional (as opposed to longitudinal) data. The author(s) demonstrate that only a few such samples are necessary in order to provide some formal guarantees on the recoverability of the potential function for such a process as well as the ability to apply the method to multi-dimensional data. They perform this analysis in the context discovering the epigenetic landscape (the potential function) of a population of differentiating embryonic stem cells. They further show that the Wasserstein distance is the intuitive counterpart to L2 distance in the context of population-level diffusion modeling, and that they can fulfill the constraints associated with the guarantees they provide using an entropic regularizer. The resulting learning problem is interpretable as an RNN, and they demonstrate its effectiveness in discovering the epigenetic landscape of both synthetic and real embryonic stem cell data. Clarity - Justification: The paper is well-written and all proofs and derivations are rigorous, clear and easy to follow. Significance - Justification: Longitudinal data, especially in the context of healthcare, is difficult to come by due to the associated cost and effort associated with its collection. As a result, methodology for the analysis of cross-sectional population data is very valuable if formal guarantees on their effectiveness can be given. Therefore, the contribution of this submission is significant. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The problem that this submission proposes to address is an important one, especially in domains such as healthcare where longitudinal data may be hard to come by. The paper is very well-written, and even in areas where the reader may lack some background, the organization and logic of the writing makes it easy to understand the conclusions. The model proposed is well-motivated, and the authors provide plenty of context for the contributions of their guarantees with regards to fewer-sample cross-sectional data. There are a few minor points to consider: 1. Some small edits for English may be appropriate; for example, there are a couple 'nonsense sentences' between lines 804-816 on the last page of the text, in the first column. 2. A couple points could benefit from additional/clearer explanation: a. In the context of their real data performance test, the authors do not explain why they choose to measure accuracy in predicting day 4 embryonic stem cell data based on day 0 and day 7 data (as opposed to measuring day 2 data, which they note is also available). If there is a good reason why they chose one over the other, then they should note this; if not, I would expect that predicting both would not be significantly more difficult. b. The first test for learning high-dimensional flows (fig 3a) demonstrates that (as they predict), parametric models perform quite well at recovering a simple potential function. The point out that as the dimensionality increases, the RNN is competitive with linear model; however, they do not really discuss the significance of the difference between the RNN’s performance and the linear model, and we are only given 3 sample points for which to visually judge whether or not we believe that this is true. c. In their discussion, the authors conclude that they have demonstrated that their model performs well on using multiple synthetic datasets; however, it appears that the data they used from Klein et al., 2015 is not synthetic. These results should be mentioned (or the language should be clarified) in the discussion. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a novel technique for modeling population dynamics. The model is based on specifying a potential function and a recurrence relationship. The method uses RNN to capture the recurrence. An identifiability result is provided for population sampled at non-stationary points in three different scenarios: derivative of the density wrt to time is approximable with a finite number of samples, integral of the density over time is approximable with a finite number of samples, the population is tracked until near equilibrium is achieved. The last scenario requires small number of samples. The method is applied to synthetic examples where it performs on par or better than the competing methods in terms of held-out data prediction. The real data results demonstrate an substantial prediction improvement. Clarity - Justification: The paper is well written and laid out. Some of the training details are missing as are the suggestions on how to apply the method to other datasets. IPython notebook is a welcome addition that does help clarify some of the issues and will help with paper's impact. Significance - Justification: Authors present a framework for learning population dynamics from a finite sample. With increase in longitudinal datasets tracking cell populations there is increasing need for methods to analyze these non-linear types of data. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Authors provide a set of theorems that guarantee identifiability of the underlying dynamics. These results parallel the setting of molecular dynamics and canonical ensemble. The key difference here is that the dynamics are learned from the observed system, rather than just simulated. The simulation is relegated to the RNN rather than a numerical integrator. The IPython notebook is welcome addition and it certainly will raise the impact of the paper. Providing a more portable version of it – do we need to get the data and fix the paths – would be helpful. Pre-training is a little bit unclear. Please clarify what “we can pre-train the objective function on the regularizer alone” means. In the real data experiments, shifts of the peaks (Krt8, Pou5f1) or even absence of multimodality (TagIn) can be seen. Why does this happen? Does this occur in synthetic data, and if not how could it be replicated? Why not show results for Local in supplemental plots? Why are 500 latent layers required to describe dynamics of 5-10 variables? Can you provide an intuition for what those latent variables capture? How does this scale with 1000s of genes? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper describes a method for modelling the underlying stochastic processes governing the dynamics of populations, which could be for example a population of stem cells differentiating over time. The paper uses an RNN to model the potential function of the diffusion process (the potential as a sum of hidden activations). The RNN model is justified by earlier links between modeling the diffusion process using drift functions equal to the negative gradient of the potential and the JKO theorem, resulting in using the Wasserstein metric with relative entropy regularization (a kind of variational objective function). They show recoverability of the potential under several conditions, including a realistic condition with only a few time observation. Experiments show recovery of potentials in simulated settings, comparable results to linear models for small dimensions, and significantly better results for higher dimensions. For models number of genes, they perform better than baselines. Clarity - Justification: The paper is pretty dense, with many ideas and concepts contained within. Despite this the paper is fairly easy to follow. Significance - Justification: In my opinion the paper is strong because it ties several theoretical concepts together with a learning algorithm (RNN) trained on real population data. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper also gives indications regarding its limitations (re the number of samples required for using an RNN v linear model for diffusion modeling). Besides lower Wasserstein error, the paper also demonstrates correctness by recovering an intermediate, multi-modal population for gene Krt8, despite only using the unimodal observations (at previous and final time-steps.) =====