We thank the reviewers for their valuable comments. We first explain our factoring method and then respond to individual comments.$
Although we use the same term “factoring”, the statistical intuition and details for our factoring scheme is different from that in (Taylor et al., 2009). Factoring in FCRBM is based on the assumption that the weight tensor in the energy function is low rank, hence it is approximated through tensor rank decomposition. In our case for transition parameters, there is an additional assumption: sequences with different styles have underlying structural similarities. We capture these by explicitly introducing W_a and W_c that are shared among styles, with W_b modeling differences across styles (see Eqn. (11)). As demonstrated in the experiments, this method allows for better generalization.

For reviewer 1:
- We will fix the notation in Fig 1, add definition to diag(), fix the citation for g-RTRBM and the typo.
- The offsets of the biases b^{(y)} and c^{(y)} are omitted for simplicity; we will clarify this.
- We will add some discussion on modeling the log variance as shown in Eqn. (10). The variances changed from frame-to-frame.
- During the prediction phase, we can use the expectation of the sampled variables to make an estimate.
- In line 519, we mentioned that FCTSBN without transition parameters between observations are denoted as “Hidden Markov FCTSBN”.
- The transition length is controlled by us, yet in practice, “a quick (frame-to-frame) change of labels will simply produce a ‘jerky’ transition” (Taylor et al., 2009), so we used fewer frames during transition to show our model’s robustness when compared with FCRBM. We will clarify this in the final draft.

For reviewer 2:
- We provide comparisons with other models only on the mocap dataset because it is widely studied. We will add results using FCRBM in Table 2.
- We believe binary latent variables can be an appropriate choice since RBM-related models have shown promising performance on various tasks. It is interesting to compare with continuous latent variable models.
- The high variance issue is alleviated by the variance reduction method introduced by (Mnih et al., 2014), which we found was effective. This is also how (Gan et al., 2015) trained the TSBN model.
- The sequences we train on contain time series of thousands of frames, with mini-batches/snippets of 25 frames. In the mocap10 experiments, we found that our generation differs greatly from the training data. Moreover, we can combine/switch styles smoothly, which is unseen in the training set. The results are included in the supplementary.
- Indeed, there are some run cases where the best-performing models perform worse, but there are also cases where they perform much better; we believe averaging results is reasonable. The variance is high partially because of the small dataset size and the sampling process. We can see a larger performance gap between FCTSBN and other models in the mocap10 datasets (supplemental C.2).
- We tried increasing the capacity of the TSBN model, but the prediction results were not significantly better. We will add this in the final draft.
- The choice of the order is briefly discussed between lines 526-533. For generation, we consider a relatively fair comparison with FCRBM, and 12 is the order they used. For prediction, we found that an order of 8 already gave good performance. For semi-supervised learning, a smaller order can demonstrate the performance gap better over a small training set. 6 is the order that was found to achieve this goal. We will discuss these in a footnote.
- Non-multiplicative interactions could be interesting. We did not consider this, because we believe multiplicative interactions will perform better.
- Alg. 1 is the same as in (Mnih et al., 2014). Minor differences are: (i) The gradient calculation is different. (ii) The summarization operator in the original NVIL is over a mini-batch of individual data points, while the summarization operator in Alg. 1 (lines 68 and 78) is over a mini-batch of consecutive time steps. We will clarify this.
- We mention “explaining away” to claim that this makes inference hard in directed belief nets. In fact, we are avoiding the “hard to do inference” effect which arises from “explaining away”. We will clarify this in the draft.
- We uniformly sample the continuous-time series data to discrete-time series data, and then we use FCTSBN to model the sampled data, e.g. the mocap data was uniformly sampled at 120Hz.
- \sigma without subscripts is sigmoid function, while others are std-dev parameters. We will clarify this in the draft.
- \pi is the parameter for the prior. We will clarify this in the draft.
- We will also cite (Memisevic &amp; Hinton 2007).

For reviewer 3:
- We provided videos from the mocap tasks in the supplementary material.
- We will cite (Oh et al, 2015) and leave the comparison as interesting future work.