We thank all reviewers for carefully reviewing our paper. $ The key contribution to PSIM is the combination of Inference Machines and PSRs. Even when learning algorithms are consistent (e.g. spectral methods, with the correct choice of rank), inevitable small imperfections in the learned models may result in compounding error when filtering. The compounding error is the result of filtered states wandering away from the training distribution due to recursive use of the learned model. Inference machines correct for this error: PSIM with DAgger works by iteratively gathering additional training data, teaching the learned PSR how to recover from filtering errors on real data. In other words, PSIM directly attacks the problems inherent in using the learned model in practice, and it is the first PSR-type learning algorithm to guarantee filtering performance. Previous algorithms (e.g. N4SID and IVR) attempt to learn accurate parameters, but do not consider the compounding effect that errors in parameter estimation have in practice. We will update the final manuscript to clarify this contribution. NOTE: Rx denotes a remark from reviewer x and A denotes our response. R1: Regarding finding a good F from {F_i} A:We return F by evaluating {F_i} on validation set. One can return the average of {F_i}, for which the theoretical bounds also hold. R1: Details on N4SID & IVR A:Reduced-rank ridge regression is used for N4SID and IVR, with the rank chosen on a validation set. N4SID is implemented following the pseudocode in pg. 121 of the book. R1: 1. IVR & N4SID on a controlled synthetic task A:The reviewer is correct, we should have included IVR & N4SID on the synthetic experiment. We did observe IVR & N4SID’s performance on synthetic data gets better as data increases, but PSIM with DAgger still converges faster. We will include this result in Fig 2. R1: Regarding 2a Previous methods do not address the compounding effects that errors in parameter estimation (due to model mismatch or limited data) have on filtering performance, which is the main contribution of PSIM. R1: Regarding 2b A:IVR framework allows for nonlinear regressions ONLY in the first stage. With phi_1, IVR with RFF (using RFF in the first stage) did not work well in our datasets. This is because IVR is restricted to use linear regression in the final step (stage 2), otherwise the estimates are biased. Hence IVR with RFF on stage 1 is still subject to linear models. It is unclear how to integrate non-linear regression for N4SID. Naively mapping observations to RFFs and then applying N4SID performed extremely poorly. A proper kernelized extension of N4SID is beyond the scope of this work. R1: Regarding 3 and 4 A:Comparing against kernel PLGs is an interesting idea, and we will work to include them in the final manuscript. Kernel PLGs have greater representational capacity than linear models and we expect them to perform better than, e.g., IVR without RFFs. However the learning algorithm for kernel PLGs does not address the compounding effect that errors in parameter estimation have on filtering performance. Interestingly, N4SID is directly related to PLGs and spectral learning. In fact, N4SID can be viewed as a spectral learning algorithm for both LDSs *and* PLGs. There is a close connection: N4SID represents the state of a system as the sufficient statistics of a multivariate Gaussian distribution of future observations and learns the set of parameters required to update this state. Assuming training data was generated by an LDS, the parameters recovered by N4SID are shown to be a similarity transform of the true parameters in the limit of infinite training data. We will add a discussion of N4SID and PLGs to the final manuscript. R2: Autoregression & standard regression fitting A: For latent state space models, the belief state depends on the WHOLE history. Regressing from the whole history to future is unrealistic. AR models often require a long history sequence in practice (Sec 5.1) . Stronger regressor could improve the performance, but in theory, AR model would still require the whole history (or a long history in practice). Predictive states do more than just predict observations, they directly represent belief states, which depend on the whole history, as a distribution of future observations. [1] shows using predictive state from N4SID as contextual feature is useful for structured prediction. R2 & R3: RNNs We observed PSIM+RFF performs better than simple RNN (SRN) (62% on Beach Video, 43% on Motion Capture, 9% better on Robot Drill). RNN may be stuck at local minima due to long trajectories (~200 steps). PSIM’s predictive state has statistical meaning, but RNN’s hidden states do not. [1] shows models from spectral methods can initialize RNNs. We are exploring this possibility for PSIM. [1] Belanger et.al, A Linear Dynamical System Model for Text, ICML 2015.