Paper ID: 599 Title: Factored Temporal Sigmoid Belief Networks for Sequence Learning Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose an extension of temporal sigmoid belief networks (TSBN, Gan et al, 2015) that uses a particular factorisation for the weights of the network to introduce conditioning. The key change is equation (11). The authors then use NVIL for inference. Results for the deep variant of the model performs better than a number of other models. There are not many deep temporal generative models and this weight parameterisation appears to improve upon the TSBN. Clarity - Justification: The text was easy to follow and the model well explained. There appears to be sufficient information in the experiments for reproduction (learning rates, etc). Significance - Justification: Conditioning upon side channel information in generative models is an interesting problem, and the authors show that a straightforward means of doing this works on some low dimensional problems. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): It would be great to see: * performance on higher dimensional data sets. * videos from the mocap tasks. For example, a comparison with "Action-Conditional Video Prediction using Deep Networks in Atari Games" (Oh et al, 2015) would be extremely interesting, where y_t is the action. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes an extension of Temporal Sigmoid Belief Networks whereby side information can be incorporated by conditioning. The paper proposes a factored form of the model which reduces the parameterization from N^2 -> N^3. Similar to previous work on conditional, factored generative models, the authors apply the technique to generating "styled" mocap data. The paper also proposes a semi-supervised variant which can be used for sequence classification. Experiments on weather prediction with side information and text generation using word embeddings are also provided. Clarity - Justification: The paper is very clear in its aims, its contributions, and the technical description of the model. I have a few small suggestions to improve clarity: - In Figure 1, label \hat{W} as \hat{W}_1 as this is a specific instance of the "hidden-hidden" weights, not a general diagram (as noted in the footnote, the other weights are similarly defined, which is OK) - The biases b^{(y)} and c^{(y)} as defined after Equation 10 do not appear to have an offset component. Does this mean that a 1 is added to the y_t vector so that there is an affine component? - One difference between the conditional RBM-style of models and the TSBN is that the TSBN also conditions the log variance on the previous states; therefore it's more heavily parameterized and dynamic. It's an interesting feature and perhaps could be discussed (either in developing the model, or in the experiments). For example, in the motion capture, are the variances indeed changing from frame-to-frame, or style-to-style? - Define the diag() operator when it's first used Significance - Justification: At first glance, this paper may be seen just as an extension of the (Gan et al. 2015b) TSBN with conditioning information similar to (Taylor and Hinton 2009). However, the factoring taken in this approach is different than in previous work, and quite sensible in that the weight-sharing pattern of the factored W matrix is well-motivated: having a component that is explicitly shared among styles and other components shared across styles. The method is also demonstrated not just on mocap data but on weather and text data. Also, it's evaluated generatively (predictive on both mocap and weather) and as a classification technique. One aspect of the experiments (though they are strong) is a direct comparison to other temporal variants of directed models using stochastic variational inference (e.g. the many recurrent variants of variational autoencoders that have been proposed within the last 1-2 years). Also commendable, the paper is really clear with respect to its positioning with respect to related work (Section 4 is extensive) and provides technical details of the implementation in supplementary material. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - A comment to the authors: have you considered a way to make predictions that doesn't involve sampling; for example, in mocap generation, sampling may be desirable and can explore different modes of the predictive distribution; but in weather modeling, one may want a "single best estimate" - I believe the reference for the g(Gaussian)-RTRBM is incorrect; it should be Sutskever et al. 2009 - In the paragraph before 5.4, the paper describes a "Hidden Markov FCTSBN" but this terminology isn't used earlier... in a sense, since the FCTSBN generalizes HMMs, aren't all the models considered to be "Hidden Markov" FCTSBNs? - I didn't understand the statement about "The entire transition process happens over around 60 frames, which is much faster than the 200-frame transition proposed for FCRBM" -- isn't the length of the transition just controlled by the experimentor who sets the style inputs? In other words, I don't see this being an advantage of the model, just a change in running the experiments. - Typo "classifcation" in section 5.3 ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper extends temporal sigmoidal belief networks (TSBNs) as proposed in Gan et al. (2015; directed time-series models with one or more layers of latent binary variables) to jointly model a distribution over time series and side information (e.g. label). The model employs a factorization of the weight matrices as e.g. in Memisevic & Hinton (2007) to compactly parameterize the influence of the label on the time series dynamics, and the authors provide schemes for unsupervised learning and discriminative learning from labeled and partially labeled data (similar to Kingma et al., 2014). The model is empirically evaluated primarily on motion capture data but also on a weather dataset and on text data, and it is shown to outperform RBM-based time series model on the first. Clarity - Justification: The paper is clearly written. Significance - Justification: The ideas in the paper seem to be a relatively straightforward combination of TSBNs (and the associated scheme for amortized variational inference scheme; Gan et a., 2015), ideas for semisupervised learning with generative models and amortized variational inference (as in Kingma et al., 2014); and the commonly applied factorization of three-way weight tensors (e.g. in Memisevic & Hintion, 2007; Taylor & Hinton 2009). The originality is therefore somewhat limited. Nevertheless, the paper may be useful for researchers or practitioners interested in modeling time series data with side information. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As pointed out above the paper is a relatively straightforward combination of known “ingredients” so that I would not rate it particularly high in terms of originality. It could be a useful demonstration for researchers or practitioners interested in modeling time series with side information. The paper is clearly written. The experimental evaluation is relatively extensive, it includes several datasets (motion capture, weather, text), and the authors consider several scenarios (unsupervised and semisupervised discriminative learning tasks) illustrating how the proposed models can be employed. Overall the results seem to suggest that the conditional TSBN (CTSBN) models indeed outperform standard TSBN models (although see additional comments below), and that factorization of the weight matrices helps. Unfortunately non-CTSBN comparisons are provided only for the MoCap data, but not for the remaining two datasets so that it is somewhat hard to judge the quality of the learned models for those datasets. One question the paper does not really address is whether binary latent variables are an appropriate modeling choice in the first place. The only comparison is with RBM-based time-series models; continuous latent variable models are not considered. I have to admit I am somewhat surprised that the gradient estimates for the inference network do not suffer from high variance given the number of latent variables replicated over many time steps, especially for the deep models. (NB: How long are the time-series snippets that you usually train on?). It also seems to me that for experiments that involve generation from the learned models the authors regularly consider relatively high-order model variants (e.g. n=12 in section 5.2; n=6 in section 5.3). This makes me wonder a little bit to what extent the learned latent states really capture unobserved information rather than just providing an encoding of the current observation. Additional comments: - Table 1: are the best-performing models actually significantly better than the rest? In several cases the std-devs. seem relatively large compared to the differences. - Table 1: the difference between CTSBN and TSBN seems to go away more or less for deep models. Maybe the “shallow” TSBN simply hasn’t got enough capacity to model both running AND walking well? (I assume you consider a single TSBN model that has to model both types of sequences.) Have you tried increasing the capacity of the TSBN? - Section 5.2 / 5.3: You seem to be training different FCTSBN models for generation, prediction, and semi-supervised learning (order 12 for generation; order 8 for prediction; order 6 for semi-supervised learning)? Why is this? (Or am I misreading the paper?) - Have you considered non-multiplicative interactions between the labels and the other model variables? - I am confused by Algorithm 1: except for some small changes in notation this seems to be exactly the algorithm in Mnih et al., for non-temporal model. I think you should make explicit the changes needed to adapt this algorithm for time-series models. - the “explaining away” effect surely goes back further than Hinton et al., 2006; and I don’t think it is right to say that variational models “avoid” this effect. They learn an approximate posterior that takes the effect into account to the extent permitted by the approximation. - l. 132: what do you mean by “uniformly sampled” in this context? - you seem to be overloading sigma (std-dev and sigmoid function) - eq. 12: \pi is undefined here =====