Paper ID: 487 Title: Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data Review #1 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors present an extension of the multivariate normal DPMM to the setting of single cell RNA-seq (scRNA) data. This is achieved by incoporating additional per cell scaling parameters. Some theory from DPMs is adapted to this setting to show the requirements for identifiability and consistency, and a straightforward (parallelized) Gibbs sampling algorithm is developed. On simulated data from the model, the method is shown to recover true clusters better than more straightforward approaches. On real data it is argued that the clusters recovered agree better with known cell types. Finally a data imputation task is used to suggest the proposed method is fitting the data better than simpler approaches. Clarity - Justification: The paper is generally well written with only a few slightly confusing sections or awkwardly phrased sentences. Significance - Justification: I agree that scRNA-seq is an interesting new application area, but I didn't feel the results are sufficiently convincing. In particular, a comparison to BASiCs (Vallejos et al, PLOS Comp Bio, 2015), anothr MCMC based approach to scRNA-seq analysiss is missing, which makes assessing the proposed method challenging. BASiCs also learns per cell scaling parameters, and while it does not jointly perform clustering it remains to be shown that BASiCs normalization followed by a standard clustering method would not do just as well as the proposed method. In addition, BASiCs is able to make use of spike-in controls (ERCCs), which are becoming a standard part of scRNA methods. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): I like the basic idea of this paper, to jointly model the noise and technical aspects of scRNA data along with finding a clustering. My biggest concern is the lack of comparison to existing normalization methods apart from very simple ones implemented by the authors (which might perhaps be considered "straw man" comparisons). In particular the lack of comparison to BASiCs is particularly troubling. In addition, I would like some more convincing that modeling log(counts+1) as Gaussian is reasonable. Especially with the dropout issue, mentioned repeatedly in the paper, this seems particularly questionable since there will be a delta spike at 0. Since sampling is used anyway, why not use a more appropriate likelihood such as negative binomial? ===== Review #2 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): This paper proposes the use of an extended hierarchical Dirichlet process mixture model (single cell HDPMM) for clustering single cell gene expression data while simultaneously modeling the biological signal and the technical variation associated with single cell measurements. The authors introduce additional parameters at the cellular level which allow them to account for heteroscedasticity (levels of technical variation which themselves vary with the latent cell type). They can use these parameters to correct for library size (total counts) and they can use the resulting model to infer missing data (drop-out gene counts) as well. They accomplish inference via Gibbs sampling based on the Chinese Restaurant Process. Finally, they show significant improvement over current state-of-the-art methodology in clustering simulated single cell gene expression data, and they also show that this is robust to the situation in which the data is generated by a mismatched model (non-Gaussian). They also provide a qualitative analysis of resulting co-expression models in true single-cell gene expression data, and they show quantitatively that their method outperforms other methods in imputing drop-out values and in inferring clustering assignments based on the imputed data. Significantly, they show intuitive correlations between the parameters which the extend the HDPMM and the sources of technical variation in single cell measurements (noise associated with drop-out values and library size). Clarity - Justification: Overall the presentation of the paper was organized well and the results would likely be reproducible. However, the English/quality of writing is often lacking and the manuscript would benefit from some careful editing, as there are very many places where the language is confusing, unclear, awkward or the English/grammar is incorrect (most common errors involve just one or two words missing from sentences, or incorrect comma placement). As a result, there were several instances in which I had to spend some time re-reading to fully understand the content. Just a few examples of places where words should go: “There is considerablevariation in the number of hours spent among students…” “…not factoring them will lead to inaccurate interpretations.” “This contrasts 
traditional bulk…” Significance - Justification: This is an important direction for inquiry particularly because previous work either fails to account for cell-type dependent variation or uses less elegant methodology (such as normalization) that cannot successfully do away with technical noise while preserving informative signal across cells. The methodology proposed in this submission is the natural next step to addressing these issues and the authors clearly show improvement in modeling cell-type variation. Modeling cell-type, particularly in an exploratory context such as clustering, could give us important insight into cellular function for both microbiomes (such as bacteria) and for healthcare situations in which studying or identifying single cells can be imperative (e.g. cancers). As such, the contribution proposed by this submission is significant. Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): As detailed above, this is a interesting and significant contribution that I believe merits an accept. Although the clarity is somewhat lacking, the overall organization is good and the contribution of the paper is communicated. My biggest concern is with the quality of the writing, but I believe that this could be easily remedied. Additionally, there are a few other points of issue: 1. Additional motivation for single-cell analysis is desirable; although perhaps this is extrapolatable, the authors do not explain why it is that we should care about single-cell variation or discovering new cell types. 2. In order to test the performance of their method, they utilize simulated data. However, the simulated data is produced from the generative model that they propose for their work. Therefore, comparing analysis of this data using their model (or even other models based on HDPMMs) to non-HDPMM methods (such as spectral clustering) seems less informative; it is not surprising that models which assume a certain generative model are better able to capture data simulated by the same generative model (even when the base distribution is switched from Gaussian to some other distribution). In fact, the non-HDPMM models do seem to perform the worst. A better simulation may be in order, or perhaps the authors should consider extending the model-mismatch section of their submission. 3. It may be preferable for the authors to show quantitative results describing performance of sc-HDPMM vs. Phenograph in clustering the mouse cortex cells (section 6.2); it is unclear why these are not discussed except in qualitative terms which do not explicitly show improvement of the method. 4. Minor: Kullback-Leibler divergence is used, but not explained (and only referred to as KL). ===== Review #3 ===== Summary of the paper (Summarize the main claims/contributions of the paper.): The authors propose a nonparametric Bayesian method called sc-HDPMM for modeling single-cell RNA-seq data. Single cell RNA-seq is a biological experimental assay that measures the level of expression of each gene in the genome, for each cell in an assayed population of cells. Single cell RNA-seq is a relatively new technology, so developing methods to understand and analyze its data are an active area of research. The authors propose a Dirichlet process mixture model for analyzing these data sets. This model clusters the cells using a Dirichlet process mixture model (which allows the algorithm to learn the number of clusters), and models the gene expression of each cluster as a multivariate Gaussian distribution. The model learns separate variance parameters for each cluster. The authors provide a number of empirical and theoretical properties of this method for this data set, including: - A verification of a Gaussian assumption using the Lilliefors test - Proofs of identifiability (save under permutations) and weak posterior consistency that follow proofs given for similar nonparametric Bayesian methods. - A demonstration that data generated from the authors' proposed model is much better fit by this model than by similar methods. - An application to real data and comparison to a related method (PhenoGraph), albeit without a quantitative basis for comparison. Clarity - Justification: The manuscript is very well written and easy to follow, save for a few minor typos, as follows: - The manuscript frequently incorrectly uses the plural of mass nouns, such as in "variations" and "expressions". These should be "variation" and "expression" respectively. - Sections 2.2 and 3: There are many undefined acronyms, including "DP" and "DPMM". - 344: "bayes" -> "Bayes" Significance - Justification: The analysis of single cell RNA-seq data is an important area of active research, as is the development of nonparametric Bayesian approaches. However, the proposed method is a relatively incremental advance over existing nonparameteric Bayesian methods. The manuscript does not describe any significant challenges posed by applying existing Dirichlet mixture model strategies to this problem, and applies the same types of inference algorithms and proof strategies. Therefore this method is of interest primarily to computational biologists, and less interesting to a wider machine learning audience. The impact of this work could be greatly improved in either of two ways: (1) Are there ways that single cell RNA-seq data could be better modeled that existing nonparametric Bayesian methods do not allow? Are there novel methods that can be developed to enable such improved models? (2) Can derive any novel insights into cell biology and gene expression through the results of sc-HDPMM? Detailed comments. (Explain the basis for your ratings while providing constructive feedback.): Overall, the manuscript is very well written and technically sound, but may be somewhat of somewhat incremental significance. =====