Paper ID: 1017 Title: Isotonic Hawkes Processes Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This is an interesting paper. It presents a moment-based approach to estimating the weights and the link function of nonlinear Hawkes processes, along with theoretical guarantees on performance. However, some aspects of the learning algorithm need clarification/justification, and the empirical experiments should show some comparison to standard likelihood based alternatives. Clarity - Justification: The description in section 3.3 was particularly confusing. Significance - Justification: This is a nice contribution to nonlinear Hawkes process estimation, but it does follow fairly clearly given the close relationship between nonlinear Hawkes processes and generalized linear models, and the previous theoretical work on isotonic regression. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - When the nonlinearity was first defined (line 267), I was confused as to why the number of events and the number of levels of the piecewise constant function were both equal to n. It later became clear that the "jumps" of the piecewise constant function are defined only at the intensities for each event. It would be nice to make that point alongside the definition of g. - You say (lines 308-311), "z(t) is a continuous and monotonic decreasing function of time, due to the property of triggering kernels (see Figure 1(B))." What property of triggering kernels are you referring to? Isn't this only true if the triggering kernel is monotonically decreasing? Why should that be the case, e.g. what if you have a non-monotonic triggering kernel, like one with a delay? - Even assuming monotonically decreasing kernels, I am still confused about this section. Why doesn't COMPUTE-COEFFICIENT need to know the weights, w? Line 4 of Alg. 1 just says, "find t_j' that satisfies x_{t'_j} = x_{t_j}," which isn't very clear. Digging into lines 319, this seems to rely on w. If that were the case, the coefficients would need to be updated in every iteration of Alg 2. - Along the same lines, couldn't the nonlinearity have an arbitrary number of levels? For realistic datasets, n will be extremely large, so learning and representing the function for every event seems wasteful. - One of the main advantages of the nonlinear Hawkes process is that it can handle inhibitory interactions (negative alpha's in your notation), as long as the range of the nonlinearity is nonnegative. However, it doesn't look like you explored this case (all of your transfer function plots have R_+ to R_+). Also, you should mention the constraint on g(wx) > 0 in the description of the learning algorithm. - While the theoretical guarantees of this algorithm are nice, in practice, likelihood based methods perform very well for nonlinear Hawkes processes. However, they typically require a fixed nonlinearity and a discretization of time. In particular, under fairly general conditions the likelihood is log-concave. Since this is the obvious alternative, it would be nice to show improvements of your approach in either run-time or parameter estimation over, say, a nonlinear Hawkes with an exponential nonlinearity. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes Isotonic Hawkes Process to capture the non-linear influence effects of historical events. The authors propose to use a piece-wise linear monotonic function to approximate the true transfer function, thus enabling learning of both transfer function and parameters from data. The learned model has been shown to have nice theoretical properties and as the same time works well empirically. Clarity - Justification: Problem is strongly motivated. The article is well structured. The authors reviewed basic concepts of Isotonic regression and Hawkes Process, which helped a lot. Significance - Justification: Allow learning of both non-linear transfer function and parameters in the Hawkes Process. Beautiful result. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): SIM is briefly mentioned in Section 2.1. Can the Isotron algorithm can be applied to some of the experiments. It would be interesting for comparison, even though it does not yield global optimum. Page 2, line 214, triggering kernel is never explicitly defined. Is there any property with a triggering kernel? A reference would be appreciated. I think this is related to the subsequent analysis at line 284: “ .. decreasing due to the property of triggering kernel" A table to summarize all the listed symbols would be helpful. At line 334, it is not clear to me how the computation can be done. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents an efficient algorithm for learning the intensity kernel of a Hawkes process. The algorithm exploits the sparsity of discrete event data to compute and optimize over the integral of the intensity function. The paper provides theoretical guarantees about the convergence properties of the algorithm. It demonstrates with synthetic experiments that the method can accurately reconstruct a variety of ground truth intensity functions. It then also demonstrates that the method improves dramatically over simple Hawkes process models and Poisson tensor factorization models in predictive performance on three real-world datasets. Clarity - Justification: This paper is very well-written and a pleasure to read. It provides excellent coverage of a prior work. The theory is clearly explained and the paper does a good job of showing how the theory connects to the proposed algorithm. Significance - Justification: As the paper demonstrates on both network and dynamic recommendation data, the model has wide applicability. It dramatically outperforms a strong baseline in Poisson tensor factorization and clearly improves over the simple Hawkes process model. The theory behind the model is well established and parts of it may transfer to other problems involving sparse discrete event data. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I think this an excellent paper. Learning non-trival dynamics from discrete event data is an important and open problem. The proposed model family is very flexible and the algorithm is efficient and principled. The paper identifies many different applications of Hawkes processes---in particular, low-rank factor models and multi-dimensional (mutually excitatory) Hawkes processes---and demonstrates improvements in both on two very different datasets. The synthetic experiments are well chosen and the results bolster the claim that the model is capable of accurately capturing different kinds of underlying dynamics. =====