Paper ID: 1345 Title: Correcting Forecasts with Multifactor Neural Attention Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper presents a neural network model to correct baseline time series forecasts for the influence of a number of explanatory variables (« forces » or « causal factors » in the supply chain forecasting jargon). The forces (which may be nonlinear transformations of the raw input variables) are combined additively to yield the final forecast. The model relies of a differentiable soft attention model that provides « explanations » in the forms of time-varying weights over the forces, which a human forecaster can meaningfully interpret. Experimental validation on a supply-chain dataset shows the benefit of the proposed approach against a number of widely-used alternatives. Clarity - Justification: The paper is well organized, well written, and easy to understand. Significance - Justification: The paper studies an important applied forecasting problem: how to make use of explanatory variables to meaningfully improve predictive accuracy in supply chain time series forecasting, where the time series are often quite short and causal impact is difficult to reliably establish. The proposed attention model is novel in this field, and appears effective in improving accuracy. Moreover, it answers an important criticism of « black box » forecasting methods in that the attention model can provide explanations to help human operators understand the predicted values. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - This is a good paper overall: it offers an effective and novel solution to an important problem in applied time series forecasting, and takes human sensibilities into account by providing a level of interpretability. I find that the experimental evaluation could be a bit more thorough: for example, it would be nice to include comparisons using some publicly available datasets, such as from the M (M-2, M-3) competitions. - The model studied by the authors assumes that the « Forces » have a contemporaneous impact with the time series under study, which is a very strong assumption (or assumes a rather low sampling frequency, e.g. weekly or monthly). There should be discussion about how lagged impacts could be taken into account. It would also be nice to include some discussion of what a unified model — handling both the time series component and the explanatory factors — could behave. - Lines 116-119: paper states that « There is no obvious way of extending univariate time series models to the multiple data-source setting ». It appears that the authors are unaware of simple and widely-used extensions of ARMA models, such as ARMAX, which are perfectly capable of making use of an arbitrary number of explanatory variables, so long as they contribute linearly to the forecast. Or do the authors mean something different? - Line 259: the phrase « not capable of reasoning that would not be interpretable » is awkward, and probably implies too much. - Line 284: the form of this equation is that of Generalized Additive Models, which should probably be discussed: - Hastie, T. J.; Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall/CRC - Line 354: why the output is range-limited through a tanh (which does not contribute a useful nonlinearity in the output stage), if ultimately we are training a (gated) regression model? - Line 418: the objective function minimizes the quadratic forecasting error, which implies a Gaussian predictive distribution. For many supply-chain forecasting problems, this is not appropriate, since the observed time-series values are non-negative and the distribution is right-skewed. Why was this choice made? - Line 507: « binomial » ==> would « Bernoulli » be a better choice? - Line 522-523: When using dropout, the model weights are divided by 2 at test-time (assuming dropping probability of 0.5). Is a similar adjustment made here? - Line 706: is the reference to section 5.1 correct? - Line 802: looses ==> loses ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper proposes a model for forecasting demands of a retailer company on multiple commodities. The forecasting model proposed is based on a neural network with soft attention over multiple influential "forces" (e.g., weather, social network mentions etc) and their observations. Each force and observation are modeled to have an additive effect over a ``baseline" prediction. Sparsity penalty on attention is used. Soft attention allows also the interpretability of the forecasting results. The model outperforms model based forecasting systems and a non-attentive neural network. Clarity - Justification: The paper is generally clear and easy to understand. Significance - Justification: Forecasting retailer demands is an important problem and the model seems to perform quite well. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): 1) Why the baseline prediction is used? Why the network cannot learn to predict from scratch? Is it a limitation of training data? 2) It is suspicious that the non attentive neural net does so much worse.. any explanation? I understand attention allows result interpretability but I wouldn't expect without attention things would not work. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper is a nice demonstration of the success of a hierarchical mixture of experts model to demand forecasting, achieving improved performance over baselines. The authors have not selected bad straw-men baselines and the improvements are substantial. The network is able to learn to incorporate diverse features, with each feature contributing to the final accuracy. Clarity - Justification: The paper is easy to understand, but there may be some confusion initially due to the terminology. The author use the word forces, where I would prefer features or 'groups of features'. It was also hard for me to decipher what was the actual final model used since the description is split across multiple models in section 3. I am not sure if the network was trained end-to-end or each layer/network separately. Significance - Justification: The only problem I have is that the techniques and model used are not particularly novel. In particular mixture of experts [1] and hierarchical mixture of experts [2]. The authors may not be familiar with this line of work, since it precedes the current deep learning literature, but as far as I can tell the models are the same. The authors have indeed used domain knowledge to define experts and where in the hierarchy experts are used, but I am not sure that this is sufficient novelty. I am not convinced that these models are particularly more interpretable than 'normal' NNs or hierarchical expert models. [1] Jacobs, Robert A., et al. "Adaptive mixtures of local experts." Neural computation 3.1 (1991): 79-87. [2] Jordan, Michael I., and Robert A. Jacobs. "Hierarchical mixtures of experts and the EM algorithm." Neural computation 6.2 (1994): 181-214. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I would strongly recommend a change of language to features and groups of features rather than forces. This will make it much more understandable to the ML community. The authors should review work on mixture of experts, and gating networks in particular and differentiate this work from the work there. Soft gating is possibly equivalent to soft neural attention. I would also recommend a diagram of the actual NN finally used - possibly instead of figure 1. Figure 1 is good because it shows what a real user sees, and shows the interpretability of the method. However, more labeling would be helpful. The line at the bottom shows that the biggest factor in each forecast is the level of the series at the previous observation (line 489). This is common to many forecasts, but still a bit disappointing and should be pointed out. I personally found that the bottom up description in the paper from simple to complex obscured the actual final model used in the experiments. This bottom up description leads to wasted space : - 2 descriptions of hierarchical models. in Section 3.2 and 3.5 - 2 descriptions of regularization considered in Section 3.6 and (10), I am slightly confused by the || context() notation. It may be better to say W = some function (x) like in (4)-(7) ? Sections 1 and 2 are excellent surveys, with an impressively deep set of references. 37: HW is better called exponential smoothing and modern references are due to Hyndman, including https://www.otexts.org/fpp 117: Hyndman shows how to combine ES and ARIMA with multivariate regression. See also http://link.springer.com/chapter/10.1007%2F978-3-662-44848-9_24#page-1 192: Don't cite Bollen. His paper has been comprehensively debunked; see http://sellthenews.tumblr.com/post/21067996377/noitdoesnot - I am not sure the comments in 117-118 on time series forecasting are accurate, especially given that the simple baselines used in this paper are examples of what I would consider multi-variate time series methods. - Relatedly, in 651-656 is the comment that flattening a matrix is non-trivial ? Or that flattening looses structure information in an undesirable way ? - I am not sure what is the distinction being made in lines 201-210. Is this a comment on time series stationarity ? - I would also remove most of 212 - 240. While possibly useful background, they are not particularly relevant to the ML problem. They may also be the result of previous review attacking baselines. - I am not convinced that these models are particularly more interpretable than 'normal' NNs or hierarchical expert modelsIf there is information about demand in two correlated features then there will be many combinations of weights in for example (8) that lead to similar predictions. A source of such correlation would be the auto-correlation of time series used in features. For example, historical demand and temperature are good predictors of current demand and current temperature - all of which are correlated. More concretely, I would argue that the weights in (13)-(17) and elsewhere are not interpretable in any real sense. There are probably too many ways to trade off forecasts in (7) with weights in (6). - On another note, a reasonable amount of the feature engineering work has been replaced by a lot of NN structure engineering. Could multiplicative relationships like (9) be computed with log transforms of features and exponential linear units ? Can all the inputs [including those in 3.5] be given as inputs at the same level, followed by a hierarchical network ? - I would recommend fleshing out 4.3 Training details, since this is very important in getting NNs to work. For example, was PCA still required ? - 783 - 789 seems to be an interesting comment, but possibly could be emphasized and described in more detail ? Lines 245 to 264 are overstated. Line 666 shows that the authors have a vector of 3139 features. Equation (4) says that each of these features is a separate neural net input--not  a new concept :-) 277: It is not clear what the real meaning is of the words observation, force, and hierarchically. As far as I can see, Equation (2) simply says there is a linear model on top of feature values. Equation (3) says that each "observation" is actually a vector of concatenated feature values. See comment on 632. 609: It seems very expensive computationally to retrain the whole baseline model at each timer tau.  624: There is not nearly enough information to make the baseline algorithm reproducible. The Fourier method doesn't even have a citation. How are P and each tau_p estimated? How can seasonality be estimated without either magic or at least two years of data (line 578)? 632: Be more explicit about the difference between forces and observations. Point out that Equations 4 to 7 have separate weigh matrices for each force f, but not for each observation i. 655: "singular" should be "single.” Other misused words elsewhere: showcase, looses 666 to 670: Give much more detail about the data. How many data points total? What are the components of the number 3139? How did you handle missing data? Was there a separate model for each commodity, or one for all?  How many parameters in the neural net? What regularization strength and other hyper parameter values were chosen? Discuss the intuitive meaning of these: heavy or light regularization? Did you use a GPU? 775: The baseline forecast is one of 3140 inputs. I t may not be surprising that lasso and other methods did not find this needle in a haystack. 5.2: I'd like to see more investigation that convinces the reader the success of the approach is capturing real signal in the data. Show an example of a commodity where weather has an intuitively correct influence. Retrain the model without the events data, since their benefit is slight and they introduce lots of additional parameters.  Explain why the social signal is useful: what are the commodities? why should a national signal be relevant to a  local store? It is well-known that Twitter is not representative of the general public; it skews towards younger, minority users and celebrity chatter. SO why is it useful here? 859: Not trivial! =====