We wanted to start by sincerely thanking the reviewers for their thoughtful comments and suggestions. Unfortunately, we do not have enough space to address many of the comments here, but will make our best effort to do so in the final paper. $ Based on reviewer feedback about a unified model that also does the time series prediction as described in section 6, we have conducted additional experiments to validate our comments. We will provide detailed explanations of the model and its results in the final draft of the paper as we are space limited here. In short, we replace the baseline input component with a GRU RNN over the time series and train the model end to end so the GRU learns to integrate with other data sources through the soft attention mechanism. In experiments, we see performance improvements relative to using the baseline as an input. In our work, “forces” represent a type of feature. Our 9 “forces” were the weather description, temperature, wind speed, visibility, relative humidity, UV description, local events, national events, and commodity social signal. For each force at each time step, multiple “observations” are provided (the number can be dynamic). “Observations” represent a single reading at the granularity of our raw measurements concatenated with an adequate context vector as in equation 3 and section 3.2. Our model handles the lagged impact of variables by allowing the model developer to specify which “observations” to send to the network at each time step. For example, if the temperature from past weeks within a window W is expected to possibly impact the forecast, all temperature “observations” from the past W weeks can be sent to the network at each time step. For large windows, adequate context vectors must be incorporated in the “observations” (such as positional encodings). In our experiments, attention-based neural networks perform significantly better than standard neural networks. However, the bulk of these gains came on the five most volatile commodities (Table 2). These commodities are “noisy” in that their sales are highly volatile, with little training data, and thousands of possible explanatory features to consider. Our intuition is that attention-based neural networks should play a role in combating this noisy data problem, especially with the imposed sparsity that should push many attention values near zero early in training. The sparse attention mechanism forces entire observation vectors to have zero influence on the prediction -- effectively shrinking the number of explanatory variables considered by the model at that point. At times, a small number of values in an “observation” vector may by chance have a high correlation with the volatility in the signal over a small period and this becomes more probable as volatility increases. The attention mechanism makes a holistic judgment based on a group of features to dismiss the entire group and shield the model from reacting to spurious correlations in a small subset of the observation vector. Our experiments seem to support this hypothesis, but a more rigorous theoretical analysis of the properties of this model will be left to future work. Our work is related to a possible incarnation of a one-level Hierarchical Mixture of Experts (HME) model where the expert networks are each learned over a different group of features that are explicitly parsed during the instantiation of the model. While there are many implementation differences, the most significant architectural differences are between the HME gating network and our proposed soft attention mechanism. Soft attention mechanisms learn attention weights from a classifier on top of the hidden representations, rather than basing it on the input representation as done in the analogous HME gating network. Our experiments show that our same setup trained like a gating network, where attention units are based off the input representation, achieves 24.98/6.74 for MAPE/Anomaly% over 20 commodities and 38.29/12.95 for the most volatile five. We also validate empirically that our proposed sparse attention regularization can add value to the incarnation of our model that leverages a gating network, improving performance over the full set to 24.01/6.23. The bulk of this increase came over the five most volatile commodities where the performance was 35.89/11.69. Our intuition is that utilizing the hidden representation should be more powerful due to more learnable parameters and more generalizable because our hidden layers tend to be small relative to the input feature size. Additionally, because the hidden layer weights are shared between both the attention score (m_if) and impact (y_if), the representation is biased in trying to solve for the attention score in a way that may improve generalization. These results seem to also suggest that our proposed sparse attention paradigm can improve certain incarnations of HME models when the data is “noisy” (defined above).