We would like to thank the reviewers for their detailed reviews and many useful comments. We feel that the following summary from Assigned_Reviewer_1 accurately reflects the main contributions of the work.$
> This paper describes a model for labeling and segmenting activities that consist of intermittent events, such as eating and smoking. The proposed CRF performs labeling and segmentation jointly, and includes features at the level of individual events, spans of time in between events, and counts of the number of positive events in an entire activity. This model is carefully structured so that MAP and marginal inference remain tractable, even though the features are fairly rich. Empirically, the method performs well on synthetic and real datasets, compared to good baselines.

We would like to respond to the following points from reviewers 6 and 3:

Assigned_Reviewer_6
==================

> The particular advantages of the proposed method are not incredibly clear. It appears to be more useful at segmenting activity than at detecting events, when compared to studied alternatives.

The reviewer is correct to point out that the larger performance improvements are on the segmentation task where we show significant improvements relative to a robust baseline. However, we note that we show better mean F1 on 3 of 4 data sets at the event detection level as well. The synthetic data results show that there is indeed a synergy between the two tasks in the right regime (somewhat noisy features and some conserved structural information). See Figure 3 for details.

> It would benefit from more comprehensive comparisons against established methods of event detection (all sorts of control charts come to mind) and against various methods of segmentation, including time series clustering.

Control charts are an unsupervised anomaly detection method. They can be used to identify when a process deviates from limits based on basic process statistics (means and standard deviations of the time series). To the best of our knowledge, this methodology is not applicable for supervised problems where the goal is to learn to detect events optimally given event labels for training, or for problems where the goal is to learn to segment optimally given labeled segmentations for training. Time series clustering is also purely unsupervised and is not applicable to the supervised segmentation problem for the same reason. By contrast, the CRF-based framework we consider in this paper is the standard model family for supervised labeling and segmentation in natural language processing, computer vision, and other many other areas.

Assigned_Reviewer_3
==================

> 1) Whats the point of the nonlinear transformation of the features? Is there some property of the sensor that should be modeled?

The non-linearities add some model capacity that we found empirically improves prediction results across all models. In cases where there is enough knowledge to model the sensor, this knowledge may be encoded in the features and, in these cases, the non-linear transformations may be unnecessary. The RQS dataset is an example where the base features were carefully engineered and paired with sophisticated discretization techniques (see the original article for details).

> 2) Do the significance results correct for the multiple tests ran or are they intended to simple point out which results are worthwhile to pay attention to?

No multiple-comparisons corrections were applied to the significance tests; however, in the segmentation case, we are only comparing two models, so no correction is needed. The reviewer is correct that the significance results are, for the most part, redundant with the performance plots, with the exception of the t-test pooled across all datasets which shows that performance is significantly better overall.

> 3) It would be nice to replicate Thomaz et al. random forest result for event detection. It seems as simple as the LR implementation and might make it more clear what one should use in practice.

We agree and are currently doing exactly this. The results will be included in the final version of the paper. Our current event detection results are very close to those reported in Thomas et al., although the experimental protocols differ somewhat in how the data are partitioned. Since Thomaz et al. do not address the higher-level session delineation problem, these results will not affect the significance of the segmentation results.

> 4) What features are important?

Our focus in this work is on the effect of modeling different types of higher-level structure while keeping the event level feature sets fixed. At the structural level, the difference between the T-CRF and HNS models are the structural factors (i.e. pairwise vs. higher-order). The performance improvements in terms of segmentation suggest that the higher-level factors we have proposed and their associated features are important.