We thank the reviewers for their helpful feedback. $
Regarding typos and table/figures ordering (suggested by reviewers 1 and 3): thanks for the suggestions, we will apply those fixes.

R1: General comments: Your points regarding the comparison between the scalability of the methods and polishing the experiments section are well taken. In the camera-ready version, we will add an experiment in which we compare the scalability of our method and the baselines. Furthermore, in addition to making the experiments section more clear, we will add more details about the experiments in our supplementary material. We also plan to publish code implementing the model/algorithm.

R1: Regarding ground truth for the Fruit fly and the Users datasets: The Fruit fly dataset provided in (Kain et al. 2013) and the Users datasets are both accompanied with the ‘gold standard’ labels. Hence, the ground truth is available for both datasets.  

R1: The maximum number of states in Table 1 is the number of possible labels for each dataset. For instance, in the Users dataset there are 23 possible tutorials (i.e., 23 labels) that the users have followed. For this dataset the maximum number of states that the SVI algorithm is truncated to is chosen from K = 80 or K=60 (mentioned in section 5.2.) based on the variational lower bound. Apologies for the confusion; we will change the column labels in Table 1 and make the text in the experiment section more clear. 

R1: Regarding the observations in the Users data: That is correct, the 59 actions are directly read from the users’ log files and an action at time t is the observation at that time point.

R3: Regarding the computational burden of the feature-based compared to the feature-independent model: the difference is in optimizing the feature weights. The gradient computation (last Eq. in 3.2) has time complexity of O(TK) where T is the length of a sequence and K is the number of latent states. We believe in situations where there is a rich domain knowledge about segmentation features, adding features could be more useful. For the Sensors data, we are simply using a raw observation as an added feature to the model. However, it seems this very basic feature is not very helpful in this experiment.

R3: Re 0.5 threshold: The threshold of 0.5 is just a natural default; it can be tuned to trade off sensitivity and specificity.

R3: Re 2nd column of 5.1: The estimate for the uncertainty over a change point at time t is in fact the marginal probability of beginning a segment at time t; that is, p(s_t = 1| Y). We will rephrase the sentence to make this point clear.