Paper ID: 1205 Title: The Segmented iHMM: A Simple, Efficient Hierarchical Infinite HMM Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In their paper the authors propose a hidden Markov model (HMM) for learning a two-level dynamics in a non-parametric way. Low-level sequences are described as a infinite hidden Markov model, while the high-level dynamics is modeled as a change-point process. At each change point the previous low-level state is ignored and a new segment of the low-level dynamics starts with a state drawn from the distribution of initial states. Probability distributions for transitions and initial states are drawn from a Dirichlet process. Stochastic Variational Inference (SVI) is used to analyze three example data sets with two-level dynamics, where the upper level corresponds to jumps between activities and the lower level describes action sequences within the activities. Results indicate that the proposed model is more efficient than both simpler (infinite HMM) and more complex (two-level hierarchical HMM). Clarity - Justification: The paper contains a clear description of the proposed model and discusses the differences to similar models in detail. Additionally, the authors explain why the model is especially suitable for data sets like those used in the experiments and how SVI works using this model. Thus all information needed to apply the proposed algorithm to similar data sets is available. Significance - Justification: The proposed model is a combination of a infinite HMM with a change-point process. As this leads to a two-level model especially suitable for efficient inference, it is an important advance in the area of non-parametric HMMs. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The paper is well written. It clearly describes the proposed model as well as a practical efficient inference algorithm. I only found two typos which should be fixed in the final version of the paper: in line 322 the logarithm is missing in the expectation value and in line 409 s(t-1) = 1 should be s(t-1) = 0. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this paper, the authors propose the segmented iHMM (siHMM), which can be applied to solve the segmentation problem, in which one is given with a time series and the goal is to identify the time stamps at which the time series transitions from a stable regime to a new regime. In particular, this model extends the iHMM in (Beal et al.) by introducing a new binary latent feature that depends on the current state of the HMM, and indicates whether the time series in changing from one regime to a different one, i.e., a new segment starts. Additionally, the authors derive two versions of the siHMM, the first one in which the new segment latent variable only depends on the latent state of the Markov chain, and the second in which it also depends n the observation at the current time. The latter model allows for including the prior domain knowledge in the model. For the propose model, they derive a Stochastic variational inference algorithm that scales linearly with the number of observations in each mini-batch (T) and quadratic with the number of states (K), which is truncated to a given value during inference. The proposed model is simple and, in contrast with the existent approaches based on hierarchical HMMs, allows for efficient inference. However, although the experimental results show that the proposed model provide slightly more accurate results in the segmentation task than the standard iHMM and the two-levels nonparametric version of the HHMM in (Johnson, 2014), the authors do not provide any experimental (or theoretical) comparison between the scalability or mixing properties of the proposed model (and inference algorithm) with competitors. Moreover, although the description of the proposed model and inference algorithm is pretty clear, the description of the experiments can be further improved. Please, find detailed comments below. Clarity - Justification: As pointed out before, the paper is in general well written. However, some details on the experiments are not clear. For example, is the K-means algorithm used in the Users and Fruit fly behavior applications to obtain the "ground truth"? In such case, the provided error are measuring how similar the proposed algorithm works with respect to the K-means. Additionally, is the maximum number of states in Table 1, the maximum number of states that the SVI algorithm is truncated to? In such case, how are these values chosen and why not toy exploit the BNP properties of the siHMM to infer the number of necessary states? Significance - Justification: Although the proposed model and inference algorithm is an extension of existing approaches, I believe that it is necessary to provide new and more simple tools that work well in practice to solve important problems, such as segmentation, which appear in a broad range of application domains. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - In the expression m=S/M in page 4, S and M are not defined. - In the experimental section, sorting the figures and tables in the same order as they appear in text would improve the readability of the section. - How the ground truth labels are obtained in the user and fruit fly datasets? - In the case of the User data, are the actions assigned to each data point given as observations y_t? How are these 59 action labeled in the data? Are they directly read from the users' log-files? - Based on the results in Table 2, I believe that all the methods provide comparable results in terms of both error and predictive LL, being the siHMM based methods slightly better in terms of error. However, I believe that comparing the different methods in terms of scalability will strength greatly the paper. To the end, the authors could simply show a plot of the evolution of the LL with respect to the inference time. In order to fulfill the space constrains, the Inference description could be further tighten. - In general, further polishing of the writing-up of the experiments will improve the readability of the paper and potentially its impact. ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper deals with a generalization of infinite hidden Markov models (iHMMs) suited for segmentation task, and argues that it outperforms vanilla iHMM, both in terms of computational cost and practical results. The paper builds upon a model developed in the PhD thesis Johnson (2014) and uses SVI techniques developed by Hoffman et al. 2013 and Johnson & Willsky, 2014. The proposed methodology for segmentation is twofold: taking or not into account the observations in the conditional distribution of the segmentation variable (so-called Feature-independent model and Feature-based model). Clarity - Justification: The paper is well written. List of minor typos: - write SVI in plain words on page 2 - in Applications: I would recommend using the same order in the 3 applications, in the text of Introduction, in the subsections of Section 5, as well as in the rows of Table 2 - check references in the text, most of the time, brackets shouldn't be used - double check consistency in the References, such as capital letters Significance - Justification: My significance evaluation is only an educated guess since I do not feel familiar enough with the iHMM literature. However, the applications seem relevant and convincing to me. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): - It could be interesting to comment on the extra computational burden of the Feature-based model wrt the Feature-independent model. Could the authors comment about this? Additionally, in which typical situation do they expect the Feature-based model to perform better? From the sample of applications in Section 5, it seems that the Feature-independent model does better in situations where segmentation isn't too fine (such as sensors). Could it be a rule? - 2nd column of section 5.1: "We choose a threshold of 0.5 for the posterior segmentation probability to identify a time point as a change point." Could the authors provide a guidance about the choice of a threshold for the posterior segmentation probability to identify a time point as a change point? Could it be left to some post-processing? - 2nd column of section 5.1: "however, as shown in the last row of Fig. 1, the model is able to provide an estimate for the uncertainty over change points." I would recommend rephrasing: the model allows for point estimate of the change point probability, but not uncertainty over this estimate, or am I missing something? at least, this is what the last row of Fig. 1 displays, contrary to the authors' claim. Minor comment - end of section 5.4: "It seems that the single observation feature that we are using for the experiment does not help in this dataset" could you comment on which observation feature you used? =====