Paper ID: 785 Title: ForecastICU: A Prognostic Decision Support System for Timely Prediction of Intensive Care Unit Admission Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a model for predicting when a patient should be admitted to an ICU from a regular hospital ward. The authors treat this as an optimal stopping time problem. This is an interesting formulation of the problem that recognizes that there is a trade-off between accumulating evidence to support prediction and the ability to make early predictions. The proposed model is based on learning a sequence of joint Gaussian models over historical data, and then using online updating given real-time data from a target patient. Clarity - Justification: The presentation of this paper needs to be significantly improved. There are numerous typos that need to be corrected. More importantly, the way the paper is currently structured, the authors don't introduce the main algorithm until the 6th page of the paper. It could be presented before much of the material in the theoretical foundations section, for example. The related work section, which is quite long, could also be moved after the model development section and before the experiments. Finally, there are a number of experimental details that are omitted from the paper completely. In particular, the experimental section describes the baseline approaches in only a few lines. There is no discussion of how features are normalized, whether feature selection is used, how the windows the features values are extracted from are defined, how the observed stopping time in mapped to labels, or how the regularization and other hyper parameters for the models are chosen. Significance - Justification: First, this paper is clearly addressing an important problem. If ICU admission decisions can be made earlier and with better accuracy, there is a potential to have significant impact on healthcare. However, the significance of the proposed approach is less clear. It is not clear that the provided level of performance provides practical utility. Finally, due to the lack of details regarding the baseline methods, it is not clear that the comparisons are actually fair (for example, the comparisons are meaningless if the hyper parameters for the baseline methods were not appropriately optimized). Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): * As mentioned above, the authors need to provide the missing details regarding how the baseline methods were applied to give the reader some confidence that the methods were used in an appropriate way and that the comparisons are fair. * On lines 624-626, the Q function does not have a subscript. Are these Q functions different from Q_1 and Q_0? How are they related? In general, the notation and concepts in 4.1 and 4.2 could be more clear. It might help to include a diagram to show exactly what data is used when evaluating B_t. Eqs. 5 and 6 are clear, but exactly what data is plugged in is not clear. * There is no discussion of the storage and computational complexity of the approach. The full covariance matrix learned in eq 6. could be quite large as it is defined both over information channels and time steps. Computing probabilities under a full covariance Gaussian then has cubic complexity. A discussion of these points should be added. * It is not clear how eq. 9 is solved. More detail is needed here. ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper describes a Bayesian modeling approach to the increasingly important problem of detecting early physiologic decompensation in hospitalized patients. The proposed framework assumes a generative framework in which each patient is a stochastic process generating streams of physiologic data (e.g., heart rate, glucose, etc.) that eventually end for one of two reasons, ICU transfer or discharge. The authors train two generative density models for the data streams, one for each class of patient, as joint Gaussians over all the physiologic variables measured at t time steps after hospitalization (Equations 5-6). Given those estimated distributions, Bayes rule can then be applied to estimate the probability that a patient belongs to either class (equation in Section 3.3). There is also an estimated alignment of a time series (Equation 7), an arg max over possible stopping times and the resulting shift in timestamps. The alarm is triggered via a pre-defined deterministic threshold of the posterior probability that the patient belongs to the ICU discharge class. In experiments on a modestly sized data set (~1000 patients), the proposed algorithm has a superior tradeoff between precision and recall at a variety of forecasting horizons. Clarity - Justification: The paper is somewhat difficult to read -- the details of the Bayesian framework seem unnecessarily abstract (with appeals to measure theory that add little in terms of insight or theory), and there is a proliferation of notation. The model could have been described much more succinctly using, for example, a plate model specifying the generative story (sample a class, then sample a sequence length, then sample measurements at each time step, etc.). Also, the description obfuscates or omits some of the potentially critical details of a true Bayesian treatment, including prior distributions over, e.g., class and model parameters. The time step notation is very confusing. Earlier in section 4 (lines 540-550), the reference alignment point (i.e., time step t = 0) is specified as the stopping time of the stochastic process (transfer or discharge). Thus, measurement X(m,n) is the value of the mth stream at n time steps AHEAD of discharge/transfer. However, just a few paragraphs later (lines 585-595), Q^t is described as the density function of measurements t time steps AFTER the "average" hospitalization time (i.e., admission). That appears to contradict the earlier definition. Significance - Justification: This is a timely and pressing problem -- patients who take a turn for the worse are often transferred the intensive care unit (ICU), the most expensive form of care delivered in modern hospitals. Detecting such patients much earlier in their disease progression can enable early interventions, not only leading to better outcomes but also possibly enabling less costly interventions that make ICU transfers unnecessary. This problem has been the focus of a lot of clinical research, including the development of earning warning scores such as NEWS, whose widespread adoption in the UK was mandated by public policy. Clinical solutions have focused on hand-engineered features and regression methods; there seems to be a major opportunity for machine learning researchers to contribute to the development of more robust approach, learning-based approaches. It is unclear how this work is situated with respect to previous work on time series modeling and classification. They seem to argue (lines 231-244) that their work differs because these time series stop "at some point in time." I assume they mean that the time series' stopping time is not arbitrary but rather due to an external event that is triggered by (or at least can be predicted from) the time series themselves. In other words, a doctor decides to discharge or transfer a patient based on information in the time series themselves. However, this claim appears to ignore related work on, e.g, forecasting mortality in ICU patients, early detection of sepsis, and patient readmission prediction -- all from time series data. All of these problems have been around for a long time AND have seen a flurry of work within the data mining and machine learning communities in recent years. Here is a representative (but far from exhaustive) list of papers the authors ought to investigate and considering citing: * M. Ghassemi, et al. A multivariate timeseries modeling approach to severity of illness assessment and forecasting in ICU with sparse, heterogeneous clinical data. AAAI 2015. * K. Henry, D. Hager, P. Pronovost, and S. Saria. A targeted real-time early warning score (trewscore) for septic shock. Science Translational Medicine, 7 (299 299ra122): 1–9, 2015. * I. Stanculescu, C.K. Williams, Y. Freer, et al. Autoregressive hidden markov models for the early detection of neonatal sepsis. IEEE Journal of Biomedical and Health Informatics, 18 (5):1560–1570, 2014. * K. Caballero and R. Akella. Dynamic Estimation of the Probability of Patient Readmission to the ICU using Electronic Medical Records. AMIA Annual Symposium 2015. * Y. Wang, et al. Mortality Prediction in ICUs Using A Novel Time-slicing Cox Regression Method. AMIA Annual Symposium 2015. The proposed framework does appear to be distinct from approaches applied above (HMMs, AR models, Gaussian processes, Cox regression models, etc.), but it is not clear how novel it is. The corresponding graphical model seems quite simple, an elaboration of a Naive Bayes-like architecture. Observations at neighboring time steps are conditionally independent given latent class and stopping time and observed time step. It appears to be related to (but simpler than) Gaussian process-based classification and regression and dynamic Bayesian networks. How does this framework relate to those approaches, and does it offer any advantages over them? Indeed, it would seem to me that it is actually less statistically efficient than, e.g., an autoregressive model where you can ignore absolute time by modeling transitions between steps. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): This is a very interesting and well-written (if somewhat hard to read paper). Even with the criticisms articulated above, it may be of interest to the ICML community. My initial recommendation is for a weak reject, but I am willing to revise my scores after discussion and author response. In particular, I have the following comments and concerns: * The role of time is unclear (see comments under "Clarity"). What is the reference time point is (stopping time or hospital admission), how alignment is computed, and how variable-length sequences are handled? * How does this work compare with and relate to the large volume of work (much of it recent) on similar medical problems (forecasting mortality, predicting sepsis, etc.) using related methods (Gaussian processes, Markov models, Cox proportional hazards models, etc.)? * How realistic are the stochastic process assumptions (e.g., measurements at time t are multivariate Gaussian distributed independent of neighboring time steps)? I think the model gets around SOME stationarity assumptions by assuming a separate conditional distribution for each time step, but it assumes some stationarity in the form of a "constant" latent class -- i.e., that whether a patient will be transferred or discharged is, in a sense, fixed ahead of time. This is pragmatic but clearly unrealistic, at least when one considers the confounding effects of treatments. See C. Paxton, et al., AMIA 2013, and K. Dyagilev and S. Saria, Machine Learning, 2015. * What are the details of the baselines? What are the input features and how is "time" captured? Are these "strong" baselines, or does the proposed framework have a substantial advantage in that it is the only approach that explicitly models time? ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper presents a prognostic decision support system to predict when a patient will be admitted to the Intensive Care Unit (ICU) earlier than when a clinician would make the decision. The system observes a patient's stream of physiological data and alerts when a threshold of belief is crossed. The paper compares the proposed system to four other algorithms using a dataset from a large academic hospital and demonstrates statistically significant improvement. Clarity - Justification: The problem and method are very clearly described. Easy to follow, and likely easy to reproduce the same algorithm though medical datasets always make reproduction of the exact results impossible for privacy reasons. Significance - Justification: The application has potential for major impact, and the method proposed is novel. I think the combination of the two put the significance of this work above average. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I appreciate that the learned policy is subject to a constraint on PPV, versus including PPV as a term in the value function. It would seem that PPV is likely the most important metric, as it is unclear to me if a clinician would be able to learn to distinguish the true positives so much earlier than the standard of care. Constraining the model to achieve much higher PPV would obviously come at a cost to TPR, but it seems like that's where the real value of such a model is; capturing early as many positives as possible at really high precision, and leaving the rest to current practice. Otherwise, it seems hard to say exactly how this approach would compare. Overall, I think this is a great application and a solid contribution. I really don't have much to complain about and lean toward acceptance. [Lines 187, 225-226] - "where" should be "were" =====