We thank the reviewers for the careful reading and their very relevant remarks. We first answer the questions raised by each reviewers. We commit to modify the final paper according to comments and our answers.$ To reviewer 1: - The algorithm is proper in the limit, if the dimensions of the model is well chosen. It would be straightforward to modify the algorithm performing the NMF (SPA) so as to stop when all the conditional distributions are contained in the identified convex hull. This distinction is worth making. - The dependency in t in theorem 4 is t^4 and the PRFA has to be minimal. - The BPC equals the average of log( P(x_{t+1} | x_{t:1} ), where x_t denotes the t-th character. - About the experiments : 1) Negative values outputted by the Spectral algorithm are set to zero, then the values are normalized to obtain conditional probabilities. We also try to take the absolute value and then normalize. Performances were similar. 2) NNSpectral performs better on PDFA than CH-PRFA. To reviewer 3: - Some algorithms have been designed to learn HMMs/PFA. However, they need an additional step to recover stochastic matrices (by setting negative values to zero and normalize or by projecting onto the simplex). In Mossel and Roch errors introduced during this step are analyzed and do not cancel the PAC-style guarantee (this ref will be added). To our knowledge this step has not been analyzed for Tensor, but we could assume that PAC-style guarantees would still hold. Even if PAC guarantees still hold after this additional step, it has been experimentally demonstrated that it degrades the performances a lot. This has been illustrated in “Learning Latent Variable Models by Improving Spectral Solutions with Exterior Point Method“ by Shaban and al. and « Methods of moments for learning stochastic languages: Unified presentation and empirical comparison » by Balle and al. Moreover, these algorithms rely on some non-degeneracy condition (separation of eigenvalues or full column rank of transition and observation matrices). Hence, the number of hidden states has to be less than the alphabet size. CH-PRFA avoids these limitations. - In contrast to algorithms that learn HMMs/PNFA, CH-PRFA returns a PNFA without any additional steps by embedding the constraints during the regression step. This leads to better performances and better initializations for EM. - In the final version, we will make an effort to better state the relation between PRFA and near-separable NMF in the introduction. - CH-PRFA is inspired by the Spectral algorithm. But, we would like to draw your attention on the fact that weights are recovered using a quadratic minimization under linear constraints making the error analysis more complicated. To reviewer 4: - As answered to the reviewer 3. In the final version, we will make an effort to give more intuition in the introduction about the link between PRFA and near-separable NMF. To all reviewer, we would like to mention some points that make this contribution valuable and open many directions for future research. - This work makes use of PRFA which are models that have been too much ignored. PRFA lie between HMMs that seem hard to learn in the general case, and PDFA that are PAC-learnable but fully observable. In some sense, PRFA allow to handle some partial observability and are PAC-learnable. - Experimentally, it seems that many PNFA can be learnt by CH-PRFA. On PAutomaC, CH-PRFA performed equally on all kind of models (HMMs, PNFA, PDFA). CH-PRFA is a very fast algorithm that leverages the particular properties of PRFA to obtain PAC-style guarantees but it also performs well to learn more complex models. - For practical applications, it seems that many problems could be modeled by PRFA. PRFA suit well for systems that contain a PDFA that reaches all hidden states. In particular, to solve planning problems with partial observability, we could define an equivalent of PRFA for controlled processes. For example, this could be used in spoken dialog systems, where without the uncertainties in speech recognition and natural language understanding, the problem could be modeled by a controlled version of a PDFA. - This paper promotes additional research in algorithms for near-separable NMF, which has been a growing field during the last decade and an appealing alternative to the SVD. Near-separable NMF looks for extreme estimated vectors whereas SVD looks for principal axis of variations. Sometimes, representing estimated vectors on the convex hull formed by extreme vectors is more meaningful than on principal axis of variations. This paper highlights some issues in near-separable NMF where the noise is not identically distributed across matrix rows (like for topic discovery in documents of varying length). The goal for future research is to find a trade-off between extreme rows and accurately estimated rows.