** Overall comments$We thank the reviewers for their helpful comments. Analysing cross-sectional time-series is an important problem across many domains such as healthcare and social sciences, not just single-cell RNA-seq, and we intend to expand our descriptions of experiments and add a section in the supplement to make more clear the broader applicability of our work.

** Reviewer_1
- Training details are missing + applying to other datasets + Ipython notebook.
We will include more of the training details both in maintext and supplement as well as add a section describing when the same model can be applied to more general domains.
We will include an updated ipython notebook supported on github where the data download is documented and directory paths are easily set.

- Pre-training is a little bit unclear
We will clarify this by adding the discussion below:

When $\tau\to \infty$, we are asserting that the individuals from the final timepoint are drawn from the stationary distribution. As discussed in line 328, if we observe the stationary case, these points should be drawn from $\exp(-\Psi(x)/\sigma^2)$ and thus we can find Psi as the solution to a log-likelihood maximization problem. This can also be seen via the loss function (Eq 8) with $\tau\to\infty$ under a normalization constraint on $\Psi$.

- shifts of the peaks (Krt8, Pou5f1) / absence of multimodality (TagIn) can be seen. Why does this happen? Does this occur in synthetic data, and if not how could it be replicated?

For missed multimodality, we have constructed very sharply peaked potentials (such as t-distributions) with the initial distribution far from the minimum where the two modes are not distinguishable from a mode with a shoulder (much like TagIn and Lgals1 in Fig S.2). In less adversarial examples such as the Styblinski-flow, this does not occur unless the dimensionality is substantially larger than the number of individuals.

For missed shifts, switching behaviour seems to be the cause. For example, the Pou5f1 gene rapidly and completely turns off from D0 to D8. It is hard for the method to tell whether this turning off occurs as a uniform shift, or whether there is a switch say at D2. In theory, the tail mass shift of the distribution should identify which is true but this is a very sensitive measure.

Generally, it seems that the genes which have sharp peaked distributions cause issues both in theory and practice. We think Thm 3's requirement that the marginal distributions have nontrivial mass everywhere is a critical requirement even in the few time observation case.

- Why not show results for Local?
Local displayed substantially worse and spiked distributions in both 5 and 10d examples which decreased the readability of the plots. We will make separate panels for the Local model and add them to the supplement before the final version.

- Why are 500 latent layers required?
Note that we use 500 hidden units, not layers. There is only one hiddden layer between each timestep, and \Delta t controls the number of timesteps between each observed timepoint.

Regarding why we have 500 hidden units, We discuss this around line 626. Since each layers represents an linear energy barrier, a large number of barriers are needed to model complex potential landscapes. For example we require at least 2*dim layers to even get a normalizable distribution. Generating smooth and nonconvex landscapes such as those in Fig 4 requires substantially more hidden units.

The larger number of units also empirically makes optimization substantially simpler, since it is likely that at least one of the randomly initialized units is close to a true potential barrier.

- Scaling with 1000s of genes?
We require that the number of individuals be much greater than the dimensionality for us to find a reasonable nonlinear potential function. Simulations with the Styblinski flow suggest that having too few cells with appropriate regularization results in predictions similar to linear and OU dynamics.

** Reviewer_4
- Only D4 is used in prediction.
We only chose to highlight D4 as it displays qualitatively interesting multimodal dynamics compared to D2. D2 results are similar, with the linear methods performing better, and we will add the D2 equivalent for Fig3c for completeness.

- Fig3a, performance difference vs  linear and having only 3 sample points.

For fig 3a, our goal was to highlight that local methods generally perform very poorly in high dimensions and a RNN based model is on par with parametric models. We do not necessarily expect the RNN to perform as well as the linear or OU models for simple, high dimensional potentials.

We will also include more dimension points (x-axis) of fig 3a to make this point clear.

- Klein et al data referred to as synthetic.
Klein et al is real data, and we will amend the text to match.