Paper ID: 1349 Title: Automatic Construction of Nonparametric Relational Regression Models for Multiple Time Series Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes an extension of the automatic Bayesian statistician of Lloyd et al. to address multiple datasets at the same time using Gaussian process. The idea is to exploit some sort of multi-task learning in the sense that it is expected that the different datasets share some part of the covariance function. At the same time, the method proposed by the authors considers a part of the covariance function to be specific of each dataset. The method proposed is evaluated in several datasets corresponding to stock data, the house market and currency exchange data. Clarity - Justification: The paper is clearly written in general. Significance - Justification: The idea proposed seems interesting. Namely, to exploit some form of multi-task learning with the automatic Bayesian statistician. However, I believe it has some limitations. In particular, the authors have decided to use a spectral mixture kernel to describe the part of the covariance function that is specific to the dataset. This has the problem of deteriorating the interpret ability of the results, which was one key aspect of the automatic Bayesian statistician. A weak point of the paper are the experiments. It seems that the authors have consider a single train / test split of the data for each dataset and hence, they do not include error bars in Table 1. This questions the significance of the results since it is not possible to asses whether or not the results are statistical significant. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Some figures are not referred in the text. E.g. Figure 2 and 6. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an extension of a general kernel learning framework to handle structure that could be present across multiple datasets. They demonstrate the utility of this approach by finding shared structure across multiple financial time series. Clarity - Justification: This paper is very clearly laid out and written overall. The examples illustrate what's going on pretty clearly. I might suggest changing the title to more explicitly say that you're finding structure across multiple time-series. I realize the method is more general, but this paper doesn't really go into that, so maybe it's better to be more explicit. Significance - Justification: This contribution is a novel and clever extension of the ABCD algorithm. It had previously been applied to single time series, as well as multivariate regression problems. However it wasn't clear how often there would be interesting axis-aligned structure in general regression problems. The authors identified a new direction that avoids this problem, by realizing that in the case of multiple time-series, there might be non-obvious structure that is only apparent when looking at multiple time series together. I suspect this paper will get lots of attention from the financial industry. The experiments seem well-done, and are on real datasets, which is nice. It would have been nice to include a really standard method in the NLL table, such as a GP with SE, or an ARMA model. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): There are a few ways in which this framework could be extended: 1. Instead of only adding the shared kernel, they could allow multiplication as well. I bet they could have found other kinds of shared structure (such as exponential growth in some periods), if they did this (and allowed an exponential kernel). 2. The authors inherit the use of BIC as a metric, as opposed to more sophisticated marginal likelihood estimates, such as a Laplace approximation. But, one step at a time. 3. It seems like using the spectral mixture for the SRKL method caused the BIC numbers to be high. Did you check how many parameters in the spectral mixture were actually being used? Very minor comments: - it might be clearer if you don't say you introduce two new methods. Since one is a special case of the other, it might be easier just to say you introduce one new method, and investigate which parts of that method are most important. - it would be nice to see the extrapolations made by the different algorithms, especially since they have very different RMSEs. - you talk about the "search grammar getting deeper" but I think it'd be more accurate to say that search gets deeper. The grammar stays fixed. - It might be worth noting that the spectral mixture kernel is a special case of the ABCD grammar. - I suggest making your citations blue, it's a little easier to read imo. - Typo on line 517: "tow" -> "two" - The font on Figure 6 is pretty small. - you don't need lines around the edge of table 1 ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): Embedded within the automated statistician framework, the paper proposes two relational kernel learning methods for time series analysis. The approaches discoverboth a shared composite kernel, which explains the common causes of changes in multiple data sets, and individual components such as scale factors and distinctive kernels, which explain changes in individual data sets. Clarity - Justification: Overall, the paper is well written and structured. What I am missing are some qualitative examples of text generated by the automated statistician using the extended grammar for relational GPs. How do we write this in a human understandable way? What does the human learn from this? To make this mode specific, how does “CW(SE + CW(WN + SE, WN), CONST)” translate to bank person? Significance - Justification: It touches an important research questions, and the experimental results indicate some improvements w.r.t. to RMSE and BIC. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Nevertheless, I like the combination of the automated statistician with relational GPs. The experiments show improvements. The discussion of the learned kernels could have provided mode details. =====