We thank the reviewers for their helpful feedback. For any comments that require clarification, appropriate changes to the text will be made. Detailed responses are below:$
Reviewer 1:
1. Z changes with time t
The Z node indicates which of the k AR processes generates the data. Our model allows for more than one AR process to generate data for a positive (or negative) instance label. We found that this flexibility fits the data better; e.g., an activity like running may be viewed as transitioning between a few poses (with each pose represented by an AR process). Note that the HMM part of the model allows for Z to remain in the same state for several time steps if this persistence fits the data better.
Another reason for time-varying Z is that the bags in our data were chosen to contain not only the “positive” activity but also other activities which act as noise.
2. X^t not dependent on Z^(t-p):(t-1)
This was a modeling choice to simplify the observation process. With this simplification, X^t depends on Z^t (and not Z^(t-p):(t-1)) to select the appropriate AR process. Our modeling choice implies that the Z variable can be interpreted as a higher level abstraction of the raw data and has a simpler transition than the raw observations X, which require p previous values.
3. Parameter tuning poorly done
We appreciate the reviewer’s suggestion and will include the results of extending the range of C for the final version.
4. 20 random splits makes stdev biased
To obtain bags, the long time series was split by taking contiguous chunks [1,200],[201,400] ... of the original time series. The randomization for the splits occurs at the bag level and not at the instance level. Note that the samples in each bag remain in their original order throughout the experiments to preserve the temporal structure of the time-varying signals. 
We acknowledge that there is some temporal autocorrelation between train, validation and test splits that could bias the confidence intervals. One possible approach is to perform a leave-one-day-out cross validation and we will include these results in the final version.
We welcome more detailed suggestions or a reference for improving the random splits.
5. Didn’t follow experiment setup from Stikic 2011
We will clarify the phrasing of the experiment setup and include the results using STAT+FFT features from Stikic 2011 for the final version.
6. Not using MIL-based algorithms from Stikic 2011.
The Stikic 2011 paper studied MIL and semi-supervised learning for activity recognition. Our work is not in a semi-supervised learning setting and the graph-propagation approach in Stikic 2011 does not apply. The MIL approaches in Stikic 2011 were all variants of miSVM. The init-miSVM approach “cheats” by initializing miSVM with the instance labels responsible for the bag-level label. The mc-miSVM algorithm requires multi-labeled bags, which is different from our binary classification setting.

Reviewer 4
1. MIML for activity recognition
We appreciate the reviewer’s insight and we are investigating this direction for a future paper.

Reviewer 5
1. AR order
We used AR order 2 only for illustrative purposes in our figures. In experiments, we selected the AR order through a validation set.
2. Message passing complexity
The complexity for each recursive step in the forward and backward pass takes O(K); at each step, we need to compute O(K) different outcomes for the Z variables, and O(T) different outcomes for the N variables. Since there are T steps in total, the overall complexity is O(K^2T^2).
3. Is the DP belief propagation? 
Yes, the DP is a special case of belief propagation that arises from our model structure and allows for efficient exact inference. We will point this out in the text.
4. Bag definition
We consider a time series interval as a bag. In experiments, we cut the raw time series signal into fixed length time intervals, and treat each time interval as a bag and assign a positive/negative label to the bag.
5. I in the equation around line 381
I is an instantiation of all the instances labels in the bag.
6. Predicting multiple activities
Predicting multiple activities falls under the MIML framework, which we will investigate in the future.
7. Bar chart replacement of the table
We will consider this for the final version.
8. Dataset details
Opportunity: 1040 bags
Trainspotting1: 245 bags
Trainspotting2: 93 bags
It was not necessary to use stochastic gradient descent on these datasets but it would be an efficient alternative for larger datasets. We appreciate the suggestion.
9. Bag length justification
For all the datasets, each bag has a fixed size of 200 observations. Choosing the bag length requires a tradeoff between label ambiguity and ease of annotation. The longer the bag, the fewer labels the human annotator needs to provide. However, the longer the bag, the more ambiguity as to which instances in the bag contribute to the bag-level label.