Paper ID: 1210 Title: Robust Random Cut Forest Based Anomaly Detection on Streams Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The paper concerns anomaly detection for stream data. The starting point the isolation forest method which has been shown to outperform several alternatives in an extensive set of evaluation. After shedding light on the failure of such method in the presence of irrelevant dimensions. Theoretical insight in provided that sheds light on this failure but also on why the algorithm is successful in lower dimensional settings. This guides a small modification that introduced to address this weakness, together with additional theoretical results justifying that the approach is well suited for dynamic data streams and that it enables adaptive setting of the sampling size. Some principled concepts such as collusive displacement are introduced to better characterize anomalies. The resulting algorithms are validated in real dataset, showing improved accuracy. Clarity - Justification: The paper is very well written and the concepts intuitively and rigorously presented. It is very interesting to see how the reasoning progresses building upon each of the theoretical results and insights. The experiments are well described. Significance - Justification: The contributions of this paper are novel and exciting. Not only the paper's theoretical analysis sheds light and improve upon a popular algorithm, but the ideas and concept introduced provide some foundations which can facilitate future research and analysis of algorithms for anomaly detection on streams. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): See above comments ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose changing the definition isolation tree of Liu et al. (2012). The resulting method has many interesting properties, and can be turned into a streaming method. Clarity - Justification: The paper is very clearly written and understandable. Significance - Justification: The authors propose changing the definition isolation tree of Liu et al. (2012). The resulting method has many interesting properties, and can be turned into a streaming method. The method is significant since the Isolation Forest method have very good empirical performance but lacks theoretic analysis. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): The authors propose changing the definition isolation tree of Liu et al. (2012). Doing so, the method can be turned into a streaming method. The method is significant since the Isolation Forest method has very good empirical performance. The experiments are convincing. It seems however, that the new definition will be more sensitive to a change of scale between different covariates (due to the selection of dimension proportional to \ell_i / \sum_j \ell_j). This may be problematic in practical datasets. Another interesting question is, what is the RRCF actually estimating? Fig. 3 in (Liu et al., 2012) showed the score for a Gaussian. Can a similar graph not be included (e.g. in Appendix)? The Emmott et al. (2013) paper showed that random forest tends to be better than other methods. However, it would still be interesting to see comparison between the proposed method and other methods in the paper (e.g. OCSVM, SVDD on UCI benchmark datasets). Minor comments: One reference appears in the appendix (the paper slighly exceeds 9 pages) line 123-136: The explanation of the score can be improved. line 181: "1the" line 451: "for a set Z To capture" Some references are inconsistent ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): In this work, the authors develop and analyze principled methods for streaming non-parametric anomaly detection. The proposed method (RRCF) is a modification of Isolation Forests that does not suffer the same failure mode in the case of irrelevant data dimensions. Using the new formulation, the authors propose a principled way to define anomalies non-parametrically (collusive displacement CoDisp). The basic idea is to consider the expected change in model complexity, as measured by description length, of the model if a particular point were added or removed. Section 2 defines precisely the notion of outliers based on CoDisp and section 3 shows how RRF(S) can be dynamically maintained in the presence of streaming insertions and deletions. In experiments, the proposed model is shown to correctly identify anomalies in a bike rental and taxi dataset, and achieve higher precision-recall scores compared to isolation forests. Clarity - Justification: The writing is clear and puts the paper in context of previous works on anomaly detection. The technical details are fairly dense but thorough and logically structured. Significance - Justification: Handling the case of streaming data and rigorously defining anomalies without a parametric model would seem to be an important and practical problem. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Will there be code and benchmark data made available? Would the method be applicable or extensible to other streaming data modalities with higher dimensionality such as audio / video? =====