We would first like to thank the Reviewers for the careful review and the valuable comments.$ Summary of our paper This paper presents an extension to Hawkes process by treating the levels of self-excitation as a stochastic differential equation (SDE). We spelled out theoretical conditions when one modulates the intensity with an SDE and its implications thereof. We elucidated on how and why certain SDEs are not permitted, such as the ones introduced in Archambeau et al'07 & Ruttor et al'13 since they can go negative. However, the exponentiation of these processes make excellent candidates for Stochastic Hawkes. We explain how a fully Bayesian inference can be carried out. If the SDE is mean reverting, Stochastic Hawkes converges in distribution to a classical Hawkes, with the levels of excitation being the long term mean of the SDE. We recover EM as a special case. General comments # No experimental evidence 1) We do not have extensive experiments since our aim was to gain a theoretical understanding of generalizing Hawkes processes by extending the constant excitations to a stochastic process. We hope that this paves the way for the design of experiments and applications for the general class of stochastic processes governing the unobservable intensity lambda - be it a Gaussian Process (LGCP), piecewise Markov process or the SDE (as in our case). # Scalability 1) Our proposed inference algorithm has the same order of computation complexity with that of Rasmussen'13, which is O(N^2). Note however that in practice, our algorithm is faster compared to Rasmussen'13 since we can apply several optimisations which reduce the constant factor by 2 times. This is possible due to the proposed likelihood function stated in Proposition 1. Intuitively, for every event time, we were able to determine whether it is an immigrant or an offspring. If it is an immigrant, all terms involving offspring vanish; if it is an offspring of an event time j, then the terms associated with immigrants and other offspring not equals j vanish. Formally, sampling with the branching structure Z leads to simpler computation, 2) Compared to Rasmussen'13 who employed the MH algorithms, we managed to derived Gibbs samplers for most of the parameters of interest due to Proposition 1, which are more efficient because of less wastage and higher convergence. # Potential applications We trust that there will be enormous applications for Stochastic Hawkes where events and intensities accelerate each other with correlated levels of contagion. Example: A big earthquake often create many small aftershocks. The intensity generated by a big earthquake will tend to be high while the smaller aftershocks do not add much intensities. Stochastic Hawkes can capture these negative correlations. These correlated levels of contagion can neither be modelled with a constant (Hawkes'71) nor iid excitations (Rasmussen'13). To Reviewer 1 1) Base intensity: This is indeed a typo and we thank the Reviewer for pointing this out. We had forgotten to define $Y_0:=\lambda_0 - a$, 2) We agree that the term Z_i=Z_ij is confusing and have removed this term. In essence, Z_ij is defined as: Z_i0=1 if event i is an immigrant and Z_ij=1 if event i is an offspring of j, 3) Other samplers: We agree with the Reviewer that there are other samplers which are also O(n). We were aware of these but were a little unclear as to how one can incorporate SDEs into the sampling algorithm proposed by (Valera, ICDM'15; Farajtabar, NIPS'15). However, we were comfortable extending Dassios and Zhao's algorithm to include SDEs. We certainly intend to perform investigations using both recommended methods, which might turn out to be more efficient than ours. To Reviewer 2 1) Base intensity as a function of time, t: The Reviewer is correct in saying that classical Hawkes processes admit constant intensity. We make it a deterministic function of t and this is valid on two accounts: (a-Definition) If we set Y=0, we retrieve the inhomogenous Poisson Process with rate as a function of time. (b-Intuition) This allows one to consider the impact of setting an initial condition to \lambda(0):=\lambda_0, to allow the learner to model the process from some time after it has started. A similar formulation can be found in a recent work by Mohler-Lewis-Brantingham-Bertozzi'10 where the authors modelled the number of civilian deaths in Iraq using an non constant intensity, 2) We certainly intend to extend this theory to multivariate Stochastic Hawkes in the near future, 3) Hansen'15: We thank the Reviewer for pointing out this important work that we left out. We have thus included them. To Reviewer 3 1) $\wedge$: This symbols means the minimum operator. In essence, they invoke the distribution of the minima, 1-P(min(X_1,X_2)< b) = P(min(X_1,X_2)>b) = P(X_1 \wedge X_2> b)= P(X_1>b)P(X_2>b) for some value b, 2) Complexity: Yes, Dassios/Zhao's algorithm is O(n).