This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Single-cell RNA-Sequencing (scRNA-Seq) is a cutting edge technology that enables the understanding of biological processes at an unprecedentedly high resolution. However, well suited bioinformatics tools to analyze the data generated from this new technology are still lacking. Here we have investigated the performance of non-negative matrix factorization (NMF) method to analyze a wide variety of scRNA-Seq data sets, ranging from mouse hematopoietic stem cells to human glioblastoma data. In comparison to other unsupervised clustering methods including K-means and hierarchical clustering, NMF has higher accuracy even when the clustering results of K-means and hierarchical clustering are enhanced by t-SNE. Moreover, NMF successfully detect the subpopulations, such as those in a single glioblastoma patient. Furthermore, in conjugation with the modularity detection method FEM, it reveals unique modules that are indicative of clinical subtypes. In summary, we propose that NMF is a desirable method to analyze heterogeneous single-cell RNA-Seq data, and the NMFEM pipeline is suitable for modularity detection among single-cell RNA-Seq data.
Version 2 removes erroneous mentioning of Semi-NMF in the manuscript.
The consensus map of NMF and K-means methods run on the HSC vs. MPP1 dataset
The columns and rows are samples. The brightness indicates the confidence of the method to assign the samples in the same group.
Comparison of clustering methods on the mouse dendritic cell scRNA-Seq data
(A) t-SNE two-dimensional scatter-plots. Colors indicate the most favorable labeling that can be assigned to the clustering result generated by each method. The correctly and incorrectly labeled samples are marked by dot (•) and cross (x), respectively. (B) Rand measures of the methods in comparison, before and after t-SNE. Rand measure ranges from 0 to 1, where a higher value indicates a greater clustering accuracy.
(A) The kernel density estimation (KDE) plot showing the frequency of log expression values of “important genes” that separate E14.5 vs. E16.5, as detected by the various methods in comparison. (B) KDE plot of frequency of genes appear in the 71 Jackknife runs. For a certain x-value (frequency), a higher y-value (density) means that a higher percentage of genes appear around this frequency among the 71 runs. The blue block is the top 500 genes selected by NMF and the red block is all the genes in the filtered data used by NMF.
Using NMF to identify subpopulations in a single glioblastoma tumor from Patient MGH31
(A) The consensus heat map generated from NMF. The two subpopulation clusters are the evident 2 red squares, marked out by number 1 and 2. The brightness indicates the confidence level of two subpopulations. (B) The PCA plot of scRNA-Seq samples from patient MGH31, the discovered subpopulations are coded in red and blue colors. (C) The results of KEGG/BioCarta Pathway enrichment analysis. The line of significance (to the right of which meaning the FDR less than 0.05) is shown. (D) The protein interaction diagram of the KEGG pathway “Pathogenic E. Coli infection”. The proteins coded by the genes detected by NMF are highlighted yellow, with the gene names marked below.