Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Xun Zhu; Travers Ching; Xinghua Pan; Sherman Weissman; Lana Garmire

doi:10.7287/peerj.preprints.1839v2

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Xun Zhu^1,2, Travers Ching^1,2, Xinghua Pan³, Sherman Weissman³, Lana Garmire ¹

1 Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, United States

2 Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, United States

3 Department of Genetics, Yale University, New Haven, Connecticut, United States

DOI: 10.7287/peerj.preprints.1839v2

Published: 2016-03-09
Accepted: 2016-03-09

Subject Areas: Bioinformatics, Genomics
Keywords: single-cell, rna-seq, heterogeneity, non-negative matrix factorization, modularity, clustering, subpopulation, single cell sequencing, non negative matrix factorization

Copyright: © 2016 Zhu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Zhu X, Ching T, Pan X, Weissman S, Garmire L. 2016. Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization. PeerJ Preprints 4:e1839v2 https://doi.org/10.7287/peerj.preprints.1839v2

Abstract

Single-cell RNA-Sequencing (scRNA-Seq) is a cutting edge technology that enables the understanding of biological processes at an unprecedentedly high resolution. However, well suited bioinformatics tools to analyze the data generated from this new technology are still lacking. Here we have investigated the performance of non-negative matrix factorization (NMF) method to analyze a wide variety of scRNA-Seq data sets, ranging from mouse hematopoietic stem cells to human glioblastoma data. In comparison to other unsupervised clustering methods including K-means and hierarchical clustering, NMF has higher accuracy even when the clustering results of K-means and hierarchical clustering are enhanced by t-SNE. Moreover, NMF successfully detect the subpopulations, such as those in a single glioblastoma patient. Furthermore, in conjugation with the modularity detection method FEM, it reveals unique modules that are indicative of clinical subtypes. In summary, we propose that NMF is a desirable method to analyze heterogeneous single-cell RNA-Seq data, and the NMFEM pipeline is suitable for modularity detection among single-cell RNA-Seq data.

Author Comment

Version 2 removes erroneous mentioning of Semi-NMF in the manuscript.

Supplemental Information

The consensus map of NMF and K-means methods run on the HSC vs. MPP1 dataset

The columns and rows are samples. The brightness indicates the confidence of the method to assign the samples in the same group.

DOI: 10.7287/peerj.preprints.1839v2/supp-1

Download

Comparison of clustering methods on the mouse dendritic cell scRNA-Seq data

(A) t-SNE two-dimensional scatter-plots. Colors indicate the most favorable labeling that can be assigned to the clustering result generated by each method. The correctly and incorrectly labeled samples are marked by dot (•) and cross (x), respectively. (B) Rand measures of the methods in comparison, before and after t-SNE. Rand measure ranges from 0 to 1, where a higher value indicates a greater clustering accuracy.

DOI: 10.7287/peerj.preprints.1839v2/supp-2

Download

PCA plot of the mouse epithelial cell data set

The groups that are most difficult to separate (E14.5 vs. E16.5) are circled out.

DOI: 10.7287/peerj.preprints.1839v2/supp-3

Download

Characteristics of important genes calling

(A) The kernel density estimation (KDE) plot showing the frequency of log expression values of “important genes” that separate E14.5 vs. E16.5, as detected by the various methods in comparison. (B) KDE plot of frequency of genes appear in the 71 Jackknife runs. For a certain x-value (frequency), a higher y-value (density) means that a higher percentage of genes appear around this frequency among the 71 runs. The blue block is the top 500 genes selected by NMF and the red block is all the genes in the filtered data used by NMF.

DOI: 10.7287/peerj.preprints.1839v2/supp-4

Download

The heatmap of the characteristic genes (E14.5 vs. E16.5) found in common pair-wise by the various methods

The dendrogram at the bottom shows the hierarchical clustering results using the distance measured by the inverse of the number of overlapping genes.

DOI: 10.7287/peerj.preprints.1839v2/supp-5

Download

Using NMF to identify subpopulations in a single glioblastoma tumor from Patient MGH31

(A) The consensus heat map generated from NMF. The two subpopulation clusters are the evident 2 red squares, marked out by number 1 and 2. The brightness indicates the confidence level of two subpopulations. (B) The PCA plot of scRNA-Seq samples from patient MGH31, the discovered subpopulations are coded in red and blue colors. (C) The results of KEGG/BioCarta Pathway enrichment analysis. The line of significance (to the right of which meaning the FDR less than 0.05) is shown. (D) The protein interaction diagram of the KEGG pathway “Pathogenic E. Coli infection”. The proteins coded by the genes detected by NMF are highlighted yellow, with the gene names marked below.

Detecting heterogeneity in single-cell RNA-Seq data by non-negative matrix factorization

Abstract

Author Comment

Supplemental Information

The consensus map of NMF and K-means methods run on the HSC vs. MPP1 dataset

Comparison of clustering methods on the mouse dendritic cell scRNA-Seq data

PCA plot of the mouse epithelial cell data set

Characteristics of important genes calling

The heatmap of the characteristic genes (E14.5 vs. E16.5) found in common pair-wise by the various methods

Using NMF to identify subpopulations in a single glioblastoma tumor from Patient MGH31

The FPKM table for HSC vs. MPP1 scRNA-Seq dataset

Add your feedback

Supplemental Information

The consensus map of NMF and K-means methods run on the HSC vs. MPP1 dataset

Comparison of clustering methods on the mouse dendritic cell scRNA-Seq data

PCA plot of the mouse epithelial cell data set

Characteristics of important genes calling

The heatmap of the characteristic genes (E14.5 vs. E16.5) found in common pair-wise by the various methods

Using NMF to identify subpopulations in a single glioblastoma tumor from Patient MGH31

The FPKM table for HSC vs. MPP1 scRNA-Seq dataset

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article