Thank you to all reviewers for your thoughtful comments.$
R2 and R4 note that our model is general and applicable outside of international relations (IR). We present only the special case of the model for dynamic multinetworks to facilitate our explanation of how different Bayesian network models are unified under the framework of Tucker decomposition. We agree with R4 that this unification is a contribution of the paper. We focus specifically on the IR datasets for a couple reasons. 1) ICEWS/GDELT are underexplored in the ML community (we are unaware of any ICML paper that uses them) and consider the promotion of potentially impactful datasets to be a contribution. 2) Trends in IR are familiar to most people and thus the qualitative analysis helps the reader understand the model (as R2 points out). 3) We believe, as R2 says, "the study of IR is of interest in its own right". If we had more space, we would agree with R4 that exploring a more network datasets would strengthen the paper.
R2 and R5 ask about the predictive experiments. Our data is sparse because a small percent of the countries account for a large proportion of events. A random split yields a test set of mostly zeros. It is hard to tell from this whether a model gains an advantage from overfitting sparsity or by learning interesting structure. The top row of Fig 2 shows that the tensor methods dramatically outperform the network models when predicting the dense part of the data, suggesting that their overall predictive advantage comes from learning non-trivial structure. We also see that BPTD's improvement over BPTF is more dramatic in the top row, further suggesting that Tucker's advantage over CP decomposition comes from its ability to compress more complex structure. This experimental approach is taken by Schein et al. (2015) and we believe that BPTD's advantage over BPTF is best demonstrated on the task for which BPTF was presented. We respectfully disagree with R5 about the lack of evaluation: Sec 7 is dedicated to these predictive experiments that demonstrate a robust advantage of BPTD over traditional network models and BPTF. R5 requests a "paragraph of commentary on interesting insights gained." The last paragraph of Sec 8 provides this, including the geographic interpretation of the inferred communities as well as within- versus between-community interaction rates. We agree that finding geographically interpretable communities isn't necessarily surprising; but the fact that it is unsurprising is what allows us to trust that the model learns meaningful structure. We also think that the granularity of inferred geographic structure, as shown in Fig 1, is impressive.
R4 asks about the Bayes nonparametric (BNP) interpretation. The priors defined after equation 8 have a dependence on the cardinality e.g., gamma0/C where C is the cardinality. In practice, this parametric prior imposes increasing shrinkage on the factors as C is set larger. When C is infinite, this becomes a proper (not truncated) BNP prior. Even when truncated though, this prior provides the practical advantage of shrinkage, which prevents the model from overfitting even when given the capacity to do so. The BNP interpretation also prompts future work in deriving adaptive truncation inference algorithms. We will implement the suggestion to consolidate mention of the BNP priors into a single section that states their significance and benefit. R4 also asks about the setting of C/K/R. These may seem small, but in the tensor case, small increases of a given dimension increase the number of latent classes by a lot. Here the number of classes is CxCxKxR=60000! Latent Dirichlet allocation is rarely run with more than 1000 classes and even then requires engineering tricks. We perform efficient inference on a laptop in a 60K class allocation model. For more context, Schein et al. (2015) fit BPTF to GDELT/ICEWS using 50 classes. We see that the model doesn't use all of the supplied capacity (thus the cardinalities are large enough) in the shrinkage plots included in Supplementary Material.
R2 and R5 ask about temporal dynamics. We agree with R2 that the current model is limited in this regard and prompts future work. We are actively working on a follow-up model that imposes a Markov model over the core tensor to promote temporal smoothness in community interactions. R5 asks whether we observed any patterns in temporal patterns in the regime factors. We did not, mainly because the model tended to concentrate its mass on one regime. However, we are looking into whether we can use the model to detect network anomalies in longer time spans of 20 years. Overall, while we focus mainly on the task of inferring latent network structure from snapshots, we agree that this paper prompts future work in modeling temporal dynamics.
R4 asked how many events there were in the data: the subsets we used varied from 1-7 million.