We'd like to thank the reviewers for careful readings and valuable comments. We believe the constructive feedback will improve the paper and increase its potential impact to the community.$
Assigned_Reviewer_1
C1: When defining g, it would be nice to explain the jumps of g are defined only at the intensities for each event.
R1: We will clarify this in the final version.

C2: What property of triggering kernel does the method need and why? What if we have a non-monotonic triggering kernel?
R2: We required that the triggering kernel is monotonically decreasing, such as the exponential kernel exp(-t)I[t > 0], Gaussian kernel exp(-t^2)I[t > 0] and heavy tailed log-logistic kernel 1/(1+x)^2 I[t &gt; 0]. The assumption of monotonicity is needed since it ensures that the mapping from z(t) to t (restricted on an interval) is a one-to-one mapping (Figure 1(B)). Moreover, in this case, each pre-image of I_j is continuous and pre-images between different I_j are disjoint, which makes it easier to compute the coefficients. 

C3: Why doesn't COMPUTE-COEFFICIENT need to know weights w? 
R3: From the definition of w and x_t in line 220, we can see w \cdot x_t = \lambda_0 + \alpha * \sum_{t_i < t}\kappa(t-t_i). Hence to simplify line 319, we can cancel out the \lambda_0 and \alpha at both left and right sides, which leads to x_{t'_j}=x_{t_j} in line 323. Hence we do not need to know w. We will make it more clear in the final version.

C4: Can the nonlinearity have an arbitrary number of levels?
R4: One can show that the number of levels can not be more than the number of data points, as in the case of Single Index Model and isotonic regression. One can include additional regularization, such as an upper bound on the Lipschitz constant, to further reduce the number of levels.  

C5: Should explore the case with inhibitory interactions (negative alpha) and mention the constraint on g(wx) > 0 in the description. 
R5: Our method will work for this case (see the update rule for w in line 366). We will add this synthetic experiment and mention g(wx) > 0 in the description. 

C6: It would be nice to compare with the likelihood based method for the Hawkes with an exponential nonlinearity. 
R6: We will add this experiment in the final version. 

Assigned_Reviewer_3
We thank reviewer for the comprehensive and encouraging review.

Assigned_Reviewer_5
C1: Can the Isotron algorithm in SIM be applied to some of the experiments?
R1: Traditional Isotron algorithm can not be directly applied to our applications, since it is designed for regression. However, we focus on learning a temporal point process where the response variable is not directly observed and observed data are non i.i.d., a setting significantly different from SIM. 

C2: In line 214, triggering kernel is not explicitly defined. Is there any property with a triggering kernel in line 284?
R2: Many kernels can be used, such as the exponential kernel exp(-t)I[t > 0], Gaussian kernel exp(-t^2)I[t > 0] and heavy tailed log-logistic kernel 1/(1+x)^2 I[t &gt; 0]. In fact, any kernel function which is monotonically decreasing can be used. Monotonicity is needed since it ensures that the mapping from z(t) to t (restricted on an interval) is a one-to-one mapping (Figure 1(B)).

C3: A table to summarize all the listed symbols would be helpful.
R3: We will add it in the final version.

C4: At line 334, it is not clear how the computation can be done.
R4: According to the definition of x_t at line 220 and the exponential decaying kernel \kappa(t), the equation x_{t'_j} = x_{t_j} is simplified to: \sum_{t_i < t'_j}\kappa(t'_j-t_i) = \sum_{t_i < t_j}\kappa(t_j-t_i). Since t_j is given, we can find t'_j (a one dimensional variable) by solving the equation using root finding algorithms.