First of all, we would like to thank the reviewers for their detailed and accurate comments. They definitely point out some shortcomings of the paper, that we will rectify at best.$
== Reliability ==

As far as we know, the first definition of reliability in this context of early detection has been given by Parrish et al. This definition is “we use the term reliability to mean the probability that the class label assigned to partial sequence matches that assigned to the full sequence”. This definition does not focus on type of errors but only on decisions that match. If a false alarm is obtained on a full sequence, then a reliable detector would also provide a false alarm on the partial sequence. The same is expected for miss-to-detect. Hence, we want to stress that reliability is not related to AUC.

Our definition 3.1 matches the definition of Parrish in a deterministic way.  The equation in this definition says that if the detector on the full sequence does not trigger, then it should not trigger on any of the partial sequences. If the detector triggers on the full sequence, note that it will  surely trigger on at least one partial sequence (which can be the full one). We thus believe that definition 3.1 still guarantees our claim in line 104.

== Numerical experiments ==

* The chirps are two linear chirps both starting from 100Hz and respectively reaching 7000 and 8000 Hz. They are thus easier to discriminate at the end of the sequences. We apologize for missing to provide this information. We will surely provide more details and will plot some time-series (in a supplementary material if we lack space).
* Parameters for the toy and BCI problems have been kept the same so as to show that they have not been tweaked according to the problems.
* For the emotions dataset, parameters C and lambda have been cross-validated, and $mu$ have been set to $2$.

== landmarks and earliness==

In a general case, the dimension of $w$ is equal to the number of landmarks and it is denoted as $m$. In our setting, as explained in lines 373-376, we do not use predefined landmarks, but we let the algorithm select them by means of a weighted $\ell_1$ penalty, where the weights aim at encouraging the algorithm to select early landmarks. The selection of landmarks occurs in the set of all possible frames, which leads to $m=n*T$ (n the number of examples and T the number of frames in each example).     
The vector $\mu$ controls how strongly late landmarks are penalized. The larger the final weight in vector $mu$, the more penalized late-appearing frames are. It is an open question to show whether it is possible to avoid the use of $mu$ without training on partial sequences (as Hoai et al. do). This means finding a novel way of promoting earliness without using penalty or partial sequence training. 


== Comparison with MMED == 

Our model is indeed comparable to MMED on performance measure based on AUC and 
earliness (AUAMOC), (although for the toy and emotion, we have better normalized time to detect at small false positive rate).
Hence, our contribution compared to MMED is to guarantee reliable early detection and being significantly faster to train on all experimental problems. We have indeed missed to provide timing for all datasets and we will add them. In practice, the minimal gain in running time we achieve is  around 10.

== Direct optimization of the criterion ==

We seek at three objectives: high AUC, earliness, and reliability. These three objectives are conflicting, hence we are dealing with a multi-objective problem. It is indeed an interesting  open problem to find out a direct (multi-) objective optimization problem that jointly optimizes all of them.


== Other replies ==

* a model based on recurrent structure can indeed be of interest. however in some situations, detection can occur based on a single short time temporal event. 

* thanks for pointing out the work of Fawcett and Provost et al. we will discuss it in the final version of the paper.

* section 3.4 : the result of Kakade can be applied fairly easily since we are considering linear models in a landmarking space. k(,) being a psd kernel is not a necessary condition.

* we will correct all the typos, and will provide careful proofread of the final version.