Paper ID: 558 Title: Learning to Filter with Predictive State Inference Machines Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper addressed the problem of modeling a discrete time dynamical system by maintaining predictive statistic. The authors consider learning a single function (for each time step separately, or for all time steps) that updates the predictive statistic of the future given the most recent observation vector. Two algorithms, one tackles the time-dependent learning and another one time-independent, address this problem from the general supervised learning perspective. Theoretical results are provided for each of those cases, with more complete and clear picture for the time-dependent case. Synthetic experiments comparing those algorithms to AR models are presented, and several experiments with real data are shown that highlight the effectiveness of the proposed algorithms to some of the spectral learning based techniques. Clarity - Justification: The main idea presented in the paper is simple, well articulated, and the synthetic experimental results are described in enough detail that they could actually be replicated I believe. Also, I like the fact that the authors worked out the stationary Kalman filter case explicitly in the supplementary material. However, some other parts of the paper need improvement. Theoretical results of Section 4.3 (the most interesting section I find) are not as easy to follow. The statements of Corollaries 4.4 and 4.5 give a bound on the existence of a "good" F among F_1,..,F_n, but the algorithm is not actually guaranteed to return that F. So some extra-step is missing. There should be a clear corollary/theorem that states something like " The output of Alg. 2 is ...". Also, the paragraph before Corollary 4.4 is too dense, it should be written more clearly. For the real experiments, some details about N4SID and IVR are not explicitly given. IVR, for example, can employ also various regression types. Was "ridge regression" used there as well? As for N4SID, is it similar to what Boot et. al. "Closing the loop..." describe in their paper? The book is cited, but I'd expect a more specific citation or spelling out more details on how it was implemented. Finally, although the paper talks about inference (filtering), it can be used for prediction as well. So it is, after all, a generative model, correct? It would be useful if the authors clarify it explicitly. I didn't check the proofs as they are in the supplementary material. Significance - Justification: Following recent work on modeling dynamical systems with PSRs it is expected that the next step is to try to learn a direct relationship between consecutive predictive states. Work of Hefny et al. '15 hints in that direction, and Wingate et al. '07 ("On discovery and learning of models with predictive representations of state for agents with continuous actions and observations") actually does it, but using different tools. So from this perspective, the paper investigates an important question. The conceptual idea and the algorithms provided, as well as some of the theoretical results are interesting and meaningful. However, I have some reservations about the experimental results and the overall conclusion of the paper. 1. It would have been interesting to see how IVR and N4SID actually perform in a controlled synthetic task, but no results were reported. Why? 2a. The results on real data look somewhat surprising to me. If I understand correctly, PSIM-Linear coupled with \phi_1 feature set is essentially a k-dimensional LDS in the original space. Why does it do so well on such complex tasks compared to all other methods? I think that demands extra analysis/explanation. I find it hard to believe that issues with spectral learning outlined in Kulesza et al. are the reason why they don't perform well in those tasks, as the paper seem to suggest. For example, I'd expect that spectral methods perform at least comparably or even better when more data is given, but I'd love to see this analysis (at least on a synthetic data set). 2b. I can see why PSIM with Fourier features work well, and my next question would be what happens to N4SID and IVR if Fourier features are used? 3. Related to (2), there has been work done on building "PSR"s, called PLG, specifically for Gaussian systems (Kalman filter, AR, etc.) and nonlinear systems through Kernelization . I'm talking about Rudary et al. 2007 "Predictive linear-Gaussian models of stochastic dynamical systems" (and its kernelized version Wingate et al. '06 "Kernel Predictive Linear Gaussian..."). Even though the approaches are different, it is important I believe to compare those methods, especially when the features are the observations themselves, i.e. when PSIM-linear learns a stationary Kalman filter, just like PLG. 4. Yet another related work that I think is worth mentioning/comparing with: Rodu et. al "Using Regression for Spectral Estimation of HMMs" 2013. They propose learning HMM parameters utilizing various types of regression, which can be more data efficient. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): All the comments are already given in previous sections. To summarize, the paper investigates interesting and useful avenue for modeling dynamical systems based on its predictive statistics and proposed a few solutions. Currently, it leaves me uncertain as to why the proposed method actually works well, or when it is expected to work well. I will be willing to revise my recommendation if the authors provide a compelling feedback. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes a new approach to time series filtering, building on ideas from predictive state representations (PSRs) and the Inference Machine. Like PSRs and spectral methods like 4SID, the proposed approach aims to avoid EM for approximate maximum likelihood, but additionally entirely avoids a separate model-fitting step. Instead, the filtering algorithm is learned directly in a single step, essentially leveraging potentially nonlinear regression models to predict a notion of predictive state. This approach allows for some asymptotic theoretical guarantees on the performance of filtering, both for stationary and nonstationary filters. The paper also gives some experimental results showing the promise of the proposed approach. Clarity - Justification: The paper is very clearly written despite containing a great deal of mathematical and algorithmic content. Significance - Justification: While this is the first time I've seen the Inference Machine, the contribution of extending such ideas to the (nominally) unsupervised task of state space models is clear. The novelty relative to PSRs is also very clearly summarized in lines 324-338. See detailed comments. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): In some sense the problem being solved is not really unsupervised: as discussed near line 354, once the choice is made to work with the PSRs (line 258 or the featurized version on line 289) the task is essentially a supervised one. Therefore I'm wondering to what extent these ideas can be reframed as simply fitting a nonlinear autoregression with a particular structure (using a particular algorithm). If this problem is essentially one of fitting a kind of autoregression, predicting future observations in terms of past ones, then are the learning theory results simply restatements of known results? Can you highlight the ways in which this task differs from standard regression fitting (but in the time series context)? On the experimental side, the authors compared to autoregressive models and LDS models fit with N4SID in terms of both prediction performance and statistical efficiency, but as far as I can tell they do not compare to LDS models fit with EM. The paper would be stronger if EM LDSs were included. Finally, to the extent that this approach is one in which we can essentially fit nonlinear autoregressions for predictions, it would be nice to see some discussion about or comparison to recurrent neural networks (RNNs). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes a method for learning to perform filtering for a dynamical system without the need to estimate its latent state. It does this by combining ideas from inference machines with ideas from predictive state representations. Inference machines are supervised learning approaches to structured prediction that directly optimize inference performance in a graphical model instead of first parametrizing potential functions and then estimating their parameters. Predictive state representations (PSRs) pose the task of learning dynamical systems as learning sufficient features of future observations instead of learning a latent state representation and estimating the evolution of the latent state. PSRs perform filtering in two steps: linear prediction of the next extended state followed by conditioning on the current observation. This paper puts PSRs in the inference machine framework, allowing these two steps to be performed together by a potentially non-linear predictor. Clarity - Justification: The paper is very well written and easy to follow. Claims are well supported by evidence and citations. The literature review is well targeted. Figure 1 is very helpful in understanding the notation. One issue with clarity is that due to space limitations, certain information (proofs and examples) had to be put in the supplementary material that was not necessary to understand the paper, but would be quite helpful. Perhaps proof sketches could be added to the main body of the paper in a sentence or two each. In addition, the last two paragraphs of section 5 describe additional experimental results that would be better to put in a table or graph. Significance - Justification: The paper builds on several existing approaches, but in synthesizing them, takes a significant step forward. The approach strikes a nice balance between interpretability and theoretical guarantees with performance and optimization of the task at hand (filtering). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): While the paper is already tight on space, it would have been informative to compare the results of the system to a more complex, non-linear model for directly predicting the next observation from past observations than auto-regressive models. The one that springs to mind is a recurrent neural network. =====