SCJNMF: singlecell clustering integrating multiple quantification methods based on joint nonnegative matrix factorization
 Published
 Accepted
 Received
 Academic Editor
 Kenta Nakai
 Subject Areas
 Bioinformatics, Computational Biology, Genomics, Computational Science
 Keywords
 Singlecell, RNAseq, Nonnegative matrix factorization, Clustering
 Copyright
 © 2021 Shiga et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2021. SCJNMF: singlecell clustering integrating multiple quantification methods based on joint nonnegative matrix factorization. PeerJ 9:e12087 https://doi.org/10.7717/peerj.12087
Abstract
Singlecell RNAsequencing is a rapidly evolving technology that enables us to understand biological processes at unprecedented resolution. Singlecell expression analysis requires a complex data processing pipeline, and the pipeline is divided into two main parts: The quantification part, which converts the sequence information into genecell matrix data; the analysis part, which analyzes the matrix data using statistics and/or machine learning techniques. In the analysis part, unsupervised cell clustering plays an important role in identifying cell types and discovering cell diversity and subpopulations. Identified cell clusters are also used for subsequent analysis, such as finding differentially expressed genes and inferring cell trajectories. However, singlecell clustering using gene expression profiles shows different results depending on the quantification methods. Clustering results are greatly affected by the quantification method used in the upstream process. In other words, even if the original RNAsequence data is the same, gene expression profiles processed by different quantification methods will produce different clusters. In this article, we propose a robust and highly accurate clustering method based on joint nonnegative matrix factorization (jointNMF) by utilizing the information from multiple gene expression profiles quantified using different methods from the same RNAsequence data. Our jointNMF can extract common factors among multiple gene expression profiles by applying each NMF under the constraint that one of the factorized matrices is shared among multiple NMFs. The jointNMF determines more robust and accurate cell clustering results by leveraging multiple quantification methods compared to conventional clustering methods, which use only a single gene expression profile. Additionally, we showed the usefulness of discovering marker genes with the extracted features using our method.
Introduction
Advances in technology have made it possible to isolate individual cells from a population of cells and to sequence their transcriptomes at the singlecell level, known as singlecell RNAsequencing (scRNAseq). This technology has reached a surprising level of resolution that reveals the regulation of gene expression within cells. scRNAseq measures gene expression on a cellbycell basis and allows the analysis of the functions and properties of cells using this information. Many experimental protocols and computational analyses exist for scRNAseq, and they have different goals such as differential expression analysis, cell clustering, cell classification, and trajectory reconstruction. Therefore, singlecell analysis forms a pipeline (a series of procedures) that mainly consists of two parts, quantification of gene expression and downstream analysis depends on the goals. In the quantification part of a pipeline, gene expression levels measured for each cell (raw sequence data) are converted to a matrix called the “gene expression profile”. The quantification part also forms a pipeline including read quality control, adaptor trimming, demultiplexing, deduplicating using barcodes, mapping to genome, counting transcripts.
One of the objectives of singlecell analysis is to identify cell types by applying unsupervised clustering and extract a group of characteristic genes for a specific cell type as marker genes. Clustering is a useful method for the classification and identification of unknown cell groups and discovering the diversity and subpopulation of known cell types, and it is a fundamental step in scRNAseq data analysis. It is the key to understanding cell function and constitutes the basis of other advanced analyses. Quantification is a critical factor for the subsequent clustering of analysis results. Different quantification methods result in different gene expression values even when the same RNAseq library is processed. The greatest difference in the quantification methods is using an alignmentbased method or an alignmentfree method. As examples of alignmentbased methods, short reads of the RNAseq library are aligned to a reference genome using alignment tools, such as Bowtie2 (Langmead & Salzberg, 2012) and STAR (Dobin et al., 2013). Then, the gene expression values are obtained from the results using gene expression estimation tools such as RSEM (Li et al., 2010) and Cufflinks (Trapnell et al., 2010). In contrast, alignmentfree tools such as Salmon(Patro et al., 2017) and kallisto(Bray et al., 2016) estimates mRNA abundances with kmer counting approach (pseudoalignment for transcriptome indices). Depends on the methods and the property of the RNAseq library, several units of gene expression values are used such as RPKM (Reads per Kilobase Million), FPKM (Fragments per Kilobase Million), TPM (Transcripts per Million), UMI (Unique Molecular Identifiers) counts, and raw read counts. In addition, there are specialized quantification pipelines for the platform, such as Cell Ranger (Zheng et al., 2017) for 10X Genomics, and mappa for the ICELL8 system. To date, many methods for quantifying gene expression from RNAseq library have been proposed; however, a consensus has not yet been reached on the best quantification method for all data (CostaSilva, Domingues & Lopes, 2017; Vieth et al., 2019; Wu et al., 2018). Moreover, since each quantification method has different measurable genes, conventional analysis methods for cell clustering using gene expression quantified via only one method are strongly biased.
After quantification, feature selection or dimension reduction is required to analyze the gene expression profile because it has a large number of rows and columns (cells and genes). Additionally, singlecell analysis involves some differences among cells (e.g., different gene expression values derived from different phases in the cell cycle and errors in measurement, such as missing values), even if these cells belong to the same cell type. For these reasons, we need a robust and datadriven clustering method for features that reflect variation in gene expression within individual cells. To date, many unsupervised feature selection, dimensionality reduction, and clustering methods have been developed (Sun et al., 2019; Kiselev, Andrews & Hemberg, 2019; Freytag et al., 2018). In particular, SC3 (Kiselev et al., 2017) and Seurat (Satija et al., 2015) are useful datadriven analysis tools for single cells. SC3 is a clustering method that uses an ensemble of multiple analysis results using the algorithm based on kmeans. Seurat is a method to analyze single cells, and Louvain clustering is used in cell clustering. Matrix factorization, as a method of unsupervised learning, is another efficient method for cell clustering and is excellent in data dimension reduction or the extraction of latent factors. In particular, nonnegative matrix factorization(NMF) (Lee & Seung, 1999) is a suitable method for dimension reduction to extract the features of gene expression profiles because NMF interprets the data as a superposition of the gene functions and cell characteristics. NMF factorizes a matrix into multiple matrices (basis and coefficients) under the constraint that all elements are nonnegative. The product of these matrices includes an approximate matrix of the input matrix. NMF is applied to various real data because of its nonnegative feature and has been adapted to microarray data for clustering or feature extraction (Brunet et al., 2004; Zheng et al., 2011; NikZainal et al., 2012; Zhang et al., 2012) and scRNAseq data (Zhu et al., 2017; Wu et al., 2020). Especially, Shao & Höfer (2017) used LSNMF, using the projected gradient method to optimize the objective function (Lin, 2007), for singlecell clustering. By performing NMF, we can analyze more details of genes and cells (e.g., finding marker genes and performing unsupervised cell clustering) with the feature matrices extracted.
However, these clustering methods do not consider the differences in quantification methods that are upstream of this series of procedures. We usually observe that different quantification pipelines produce similar but also different gene expression profiles even when we apply them to the same RNAseq library. Consequently, different clusters are obtained and misunderstanding might be produced. Our objective is to develop a method that integrates the gene expression profiles from different quantification methods to extract reliable feature matrices for clustering. JointNMF, which performs multiple NMF with a shared matrix, is one of the most suitable methods for integrating such different but potentially similar data. JointNMF has also been studied for genomic applications, Wang, Zheng & Zhao (2015) simultaneously decompose multiple transcriptomics data matrices. Zhang et al. (2012), Yang & Michailidis (2016), Duren et al. (2018), Jin, Zhang & Nie (2020) integrated heterogeneous omics multimodal data (e.g., DNA methylation, miRNA expression, and gene expression) to detect modules, and Fujita et al. (2018) also integrated multiomics data to discover biomarkers with a combination of JointNMF and pathway analysis. These methods decompose the data into a shared matrix of “genes × modules” and some dedicated matrices of “modules × samples” for each omics data. Genes are regarded as common variables across data matrices and samples are different. In contrast, we decompose the data into some matrices of “genes × modules” and a shared matrix of “modules × cells” because the cells included in the gene expression profiles are perfectly the same in our problem setting.
In this article, we propose SCJNMF, a novel unsupervised clustering method using NMF to eliminate the differences in quantification methods and extract the common factors over multiple gene expression profiles. The features of the data can be decomposed into genederived factors that contain bias dependent on each quantification method and common cellderived factors, and cell clustering can be performed based on these common factors to obtain more essential biological information. To our knowledge, this study is the first to incorporate multiple quantification methods into the clustering analysis of scRNAseq data.
Materials & Methods
The outline of SCJNMF
We proposed SCJNMF, a method that extracts latent factors from different gene expression profiles at the same time using jointNMF, which can express matrix data as the product of lowerdimensional matrices, one of which is shared. SCJNMF extracted the latent factors in different gene expression profiles using a similar approach to NMF and used them for cell clustering and gene analysis.
The outline of our method is showed in Fig. 1. We created two different gene expression profiles using different quantification methods from the same RNAseq library. Then, we extracted the common factor using jointNMF and extended it to perform multiple NMFs in parallel with two different basis factors W_{1}, W_{2} (derived from the different methods) and shared factors H (derived from the original RNAseq library). Finally, we performed cell clustering using these extracted factors and any appropriate clustering methods, such as hierarchical clustering. Although the conventional NMF clustering methods that perform clustering directly with the factorized matrix strongly depend on the rank, our method performs robust clustering by using hierarchical clustering.
Quantification and normalization
As the first step of SCJNMF, we quantified gene expressions with different quantification pipelines, and obtained a set of gene expression profiles. Parameters and references could be set as recommended by each quantification pipeline.
For preprocessing before performing joint matrix factorization, we applied a gene filter to the input data, similar to the SC3 method (Kiselev et al., 2017), and then log_{2}(x + 1) transformed the data. Finally, we normalized the data so that the sum of each gene L^{1} norm was 1 because it should be compared the relative expression values between cells for each gene to perform the cell clustering. This L_{1} normalization is possible to consider smaller expression values than using the L_{2} norm and to keep the values nonnegative.
JointNMF
We considered the given data as a nonnegative matrix $\mathit{D}\in {\mathbb{R}}_{+}^{N\times M}$. Nonnegative matrix factorization found the basis matrix $\mathit{W}\in {\mathbb{R}}_{+}^{N\times k}$ and coefficient matrix $\mathit{H}\in {\mathbb{R}}_{+}^{k\times M}$, where all the elements were nonnegative, such that these matrix products approximated the input data matrix. (1)$\mathit{D}\approx \mathit{W}\mathit{H}$ NMF minimized the distance between matrix D and matrix WH. Here, we considered a Euclidean norm distance. Thus, the objective function that NMF minimizes was as follows: (2)$\underset{\mathit{W},\mathit{H}\ge 0}{min}L:={\parallel \mathit{D}\mathit{W}\mathit{H}\parallel}_{\mathcal{F}}^{2}$ Here, we considered a simultaneous matrix factorization of two matrices using NMF. Given the two input data as nonnegative matrices ${\mathit{D}}_{1}\in {\mathbb{R}}_{+}^{{N}_{1}\times M}$ and ${\mathit{D}}_{2}\in {\mathbb{R}}_{+}^{{N}_{2}\times M}$, jointNMF found the basis matrices ${\mathit{W}}_{1}\in {\mathbb{R}}_{+}^{{N}_{1}\times k},{\mathit{W}}_{2}\in {\mathbb{R}}_{+}^{{N}_{2}\times k}$ corresponding to each approximated input matrix and the common coefficient matrix $\mathit{H}\in {\mathbb{R}}_{+}^{k\times M}$, minimizing the distances between the given matrices and the approximated matrices. In our proposed method, we added the L1 norm constraint to this objective function for H so that the factorized shared matrix was sparse. Thus, SCJNMF found the matrix W_{1}, W_{2}, H that minimized the following objective function: (3)$\underset{{\mathit{W}}_{1},{\mathit{W}}_{2},\mathit{H},{\lambda}_{n}\ge 0}{min}L:={\parallel {\mathit{D}}_{1}{\mathit{W}}_{1}\mathit{H}\parallel}_{\mathcal{F}}^{2}+{\lambda}_{1}{\parallel {\mathit{D}}_{2}{\mathit{W}}_{2}\mathit{H}\parallel}_{\mathcal{F}}^{2}+{\lambda}_{2}\sum _{k}\parallel {\mathit{H}}^{k,\ast}{\parallel}_{1}$ where, λ_{1} is a balance parameter for the losses of reconstruction matrices and λ_{2} is a parameter for row vector sparsity regularization of shared factorized matrix.
We applied a multiplicative update algorithm to optimize the objective function same as conventional NMF. By applying Jensen’s inequalities to the first and second terms of the objective function, the function to be minimized could be rewritten as follows: (4)$\text{}\underset{{\mathit{W}}_{1},{\mathit{W}}_{2},\mathit{H},{\lambda}_{n}\ge 0}{min}L:=\sum _{i,j}\left({\mathit{D}}_{1}^{i,j}{}^{2}2{\mathit{D}}_{1}^{i,j}\sum _{k}{\mathit{W}}_{1}^{i,k}{\mathit{H}}^{k,j}\right.\left.+\sum _{k}\frac{{\left({\mathit{W}}_{1}^{i,k}\right)}^{2}{\left({\mathit{H}}^{k,j}\right)}^{2}}{{c}_{1}^{i,j,k}}\right)+{\lambda}_{1}\sum _{i,j}\left({\mathit{D}}_{2}^{i,j}{}^{2}2{\mathit{D}}_{2}^{i,j}\sum _{k}{\mathit{W}}_{2}^{i,k}{\mathit{H}}^{k,j}\right.\left.+\sum _{k}\frac{{\left({\mathit{W}}_{2}^{i,k}\right)}^{2}{\left({\mathit{H}}^{k,j}\right)}^{2}}{{c}_{2}^{i,j,k}}\right)+{\lambda}_{2}\parallel \mathit{H}{\parallel}_{1}$ where, (5)$c}_{1}^{i,j,k}=\frac{{\mathit{W}}_{1}^{i,k}{\mathit{H}}^{k,j}}{\sum _{{k}^{\prime}}{\mathit{W}}_{1}^{i,{k}^{\prime}}{\mathit{H}}^{{k}^{\prime},j}$ (6)$c}_{2}^{i,j,k}=\frac{{\mathit{W}}_{2}^{i,k}{\mathit{H}}^{k,j}}{\sum _{{k}^{\prime}}{\mathit{W}}_{2}^{i,{k}^{\prime}}{\mathit{H}}^{{k}^{\prime},j}$ We found each element of W_{1}, W_{2}, and H that minimized the objective function by performing partial differentiation. (7)$\frac{\partial L}{\partial {W}_{1}^{i,k}}=\sum _{j}\left(2{D}_{1}^{i,j}{H}^{k,j}+\frac{2{W}_{1}^{i,k}{\left({H}^{k,j}\right)}^{2}}{{c}_{1}^{i,j,k}}\right)$ (8)$\frac{\partial L}{\partial {W}_{2}^{i,k}}={\lambda}_{1}\sum _{j}\left(2{D}_{2}^{i,j}{H}^{k,j}+\frac{2{W}_{2}^{i,k}{\left({H}^{k,j}\right)}^{2}}{{c}_{2}^{i,j,k}}\right)$ (9)$\frac{\partial L}{\partial {H}^{k,j}}=\sum _{i}\left(2{D}_{1}^{i,j}{W}_{1}^{i,k}+\frac{2{\left({W}_{1}^{i,k}\right)}^{2}{H}^{k,j}}{{c}_{1}^{i,j,k}}\right)+{\lambda}_{1}\sum _{{i}^{\prime}}\left(2{D}_{2}^{{i}^{\prime},j}{W}_{2}^{{i}^{\prime},k}+\frac{2{\left({W}_{2}^{{i}^{\prime},k}\right)}^{2}{H}^{k,j}}{{c}_{2}^{{i}^{\prime},j,k}}\right)+{\lambda}_{2}$
The objective function are minimized when these are 0. Thus, the variable updates became: (10)$\mathit{W}}_{1}={\mathit{W}}_{1}\frac{{\mathit{D}}_{1}{\mathit{H}}^{\top}}{{\left[\mathit{H}{\left[{\mathit{W}}_{1}\mathit{H}\right]}^{\top}\right]}^{\top}$ (11)$\mathit{W}}_{2}={\mathit{W}}_{2}\frac{{\lambda}_{1}{\mathit{D}}_{2}{\mathit{H}}^{\top}}{{\lambda}_{1}{\left[\mathit{H}{\left[{\mathit{W}}_{2}\mathit{H}\right]}^{\top}\right]}^{\top}$ (12)$\mathit{H}=\mathit{H}\frac{{\left[{\mathit{D}}_{1}^{\top}{\mathit{W}}_{1}\right]}^{\top}+{\lambda}_{1}{\left[{\mathit{D}}_{2}^{\top}{\mathit{W}}_{2}\right]}^{\top}{\lambda}_{2}/2}{{\mathit{W}}_{1}^{\top}{\mathit{W}}_{1}\mathit{H}+{\lambda}_{1}{\mathit{W}}_{2}^{\top}{\mathit{W}}_{2}\mathit{H}}$
Applications
Cell clustering
The factorized matrices were used to perform highly accurate clustering. In our proposed method, we used hierarchical clustering (Ward’s method (Ward, 1963), implemented in SciPy (Virtanen et al., 2020)). In this study, the adjusted Rand index (ARI, implemented in scikitlearn (Pedregosa et al., 2011)) was used to evaluate clustering performance.
Given a set of n elements (i.e., cells) S = {o_{1}, …, o_{n}}, and supposing that U = {u_{1}, …, u_{R}} and V = {v_{1}, …, v_{C}} represent two different partitions of S, define the following:

a, the number of cell pairs which are assigned to the same class in both U and V

b, the number of cell pairs which are assigned to different classes by both U and V
The Rand index was calculated by (13)$RI=\frac{a+b}{\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)}$
In addition, the ARI considered adjusting measures of clustering accuracy for chance. The ARI was defined using the following formula: (14)$ARI=\frac{\sum _{ij}\left(\genfrac{}{}{0.0pt}{}{{n}_{ij}}{2}\right)\sum _{i}\left(\genfrac{}{}{0.0pt}{}{{n}_{i\cdot}}{2}\right)\sum _{j}\left(\genfrac{}{}{0.0pt}{}{{n}_{\cdot j}}{2}\right)/\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)}{\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)\sum _{i}\left(\genfrac{}{}{0.0pt}{}{{n}_{i\cdot}}{2}\right)\sum _{j}\left(\genfrac{}{}{0.0pt}{}{{n}_{\cdot j}}{2}\right)/\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)}$ where, n_{ij} = U_{i}∩V_{j}, n_{i⋅} = ∑_{j}n_{ij} and n_{⋅j} = ∑_{i}n_{ij}.
Gene analysis
Our proposed method was used not only for cell clustering but also for gene analysis using the extracted factors. The factors in the coefficient matrix that showed higher values in specific clusters than those in other clusters indicated latent factors of the cluster. Therefore, the same factors in the basis matrices also showed latent factors of the cluster, and the genes that showed higher values in the basis matrices reflected characteristic features of the cluster, in other words, marker genes.
Results
To assess the accuracy of our method, we performed cell clustering using five scRNAseq datasets and one bulk RNAseq dataset. In this clustering, we estimated the optimal parameters (the ranks in matrix factorization) using the tradeoff relationship between sparseness and loss. We also evaluated the effect of the combinations of quantification methods and the sample size for the clustering accuracy. Additionally, we analyzed more details of the factorized matrices by showing their relevance to marker genes.
Dataset
We used six different datasets including RNAseq library and quantified gene expression levels measured using each method in previous studies (Table 1). Only Monaco dataset is a bulk RNAseq, the others are scRNAseq libraries. Because Monaco dataset is a set of sorted cell types, it is suitable for the evaluation of clustering accuracy. CellBench is a set of datasets designed to benchmark various singlecell data analysis. In this study, we used the dataset named “sc_10x_5cl”, single cells from the mixture of five cell lines. In advance, we removed any cell types that were unclear in previous studies.
Dataset (citaion)  Tissue/process  Quantification  Reference  The number of cells  The number of classes 

Treutlein (Treutlein et al., 2014)  Lung epithelium  TopHat v2.0.8, Cufflinks v2.0.2  mm10  80  5 
Pollen (Pollen et al., 2014)  Brain  TopHat v2.0.4, RSEM v1.2.4  hg19  259  10 
Segerstolpe (Segerstolpe et al., 2016)  Pancrease  STAR v2.3.0e, rpkmforgenes  hg19  2166  12 
Xin (Xin et al., 2016)  Pancrease  CLC Bio Genomics Workbench v7.0  GRCh37  1492  4 
Monaco (Monaco et al., 2019)  Immune cells  Kallisto v0.43.1, Tximport v1.6.0  GENCODE Human 26  114  10 
CellBench (sc_10x_5cl) (Tian et al., 2019)  Lung adenocarcinoma cell lines  scPipe v.1.3.0  GRCh38  3918  5 
To perform our method SCJNMF and compare the accuracy of the clustering methods, We alternatively quantified gene expression values in each cell from the RNAseq library using Salmon(v1.0.0) with GENCODE references and annotations (Mouse Release 21/ Human Release 32). Only CellBench dataset was quantified by Using Salmon with Alevin (Srivastava et al., 2019), kallisto with BUStools and STAR with Cell Ranger to process barcodes in short reads.
In addition, for the first 4 datasets, we also prepared other gene expression profiles using kallisto(v0.46.2), STAR(v2.7.3), to evaluate our method for the combinations of quantification methods and the sample size.
Settings of clustering methods
We compared the accuracy of cell clustering using our proposed method to that obtained using other major unsupervised clustering methods, including LSNMF (Lin, 2007; Zitnik & Zupan, 2012), SC3 (Kiselev et al., 2017), and Seurat (Satija et al., 2015).
For our proposed method, each run consisted of ten runs and extracted an ARI score of the top combined rank in terms of sparseness and loss (the top ranking of sparseness is maximum and the top ranking of loss is minimum). The rank was determined according to the results of our experiment described in the Supplemental Information and Fig. S1 as follows: Treutlein, k = 5; Pollen, k = 8; Segerstolpe, k = 25; Xin, k = 16; Monaco, k = 8; CellBench, k = 5. In addition, we determined the regularization parameter λ_{1} = geneset1/geneset2. Hierarchical clustering could not determine the number of clusters; therefore, we set the same number reported in previous studies in advance.
In the LSNMF method, similar to our jointNMF, classification by hierarchical clustering was performed using a matrix of factors of the cells, and the rank and the number of clusters were set to the same values as ours. SC3 method was performed with the default parameters. For LSNMF and SC3, each run was performed ten times, considering random initial values. For Seurat, we plotted only the highest ARI score in the runs as the resolution parameter in the FindClusters function was increased from 0 to 1 in 0.1 steps. The parameters of the FindNeighbors function were set to the default values (dims = 1:10, k.param = 20).
Accuracy of cell clustering
Figure 2 shows the ARI score of the original (quantified previously) and alternative (quantified using Salmon) data for each clustering method. We ran a twoway experiment using λ_{2} = 1 and λ_{2} = 10.
In the Pollen, Xin, and CellBench datasets, our proposed method performed accurate cell clustering. In contrast, in the Treutlein, Segerstolpe, and Monaco datasets, the ARI score of the proposed method was higher than that of LSNMF, but it was not the highest score. Additionally, the resulting clusters of our method were stable compared to the conventional NMF method (except for the CellBench dataset).
Comparison of quantification methods in SCJNMF
SCJNMF performed cell clustering using one or two gene expression profiles. In this study, we compared the combinations of quantification methods (kallisto, STAR, Salmon) by performing cell clustering and calculating ARI using SCJNMF. To evaluate the impact of “joint” NMF, we also performed factorization and clustering using a single gene expression profile with setting to the parameter λ_{1} = 0. We ran SCJNMF repeatedly for ten trials with random initial values and the parameter λ_{2} = 10.
Figure 3 shows the ARI of the cell clustering using SCJNMF in each quantification method and their combination. As a result, for each dataset, the accuracies of kallisto & STAR and Salmon & STAR combination outperformed using the single quantification method. The accuracy of Salmon & kallisto combination was lower than that of other combinations.
Additionally, we counted the frequency of correlation coefficients for each cell and each gene in each dataset and combination (Fig. 4). In Salmon & kallisto combination, the correlation coefficients were inclined toward a higher value than the other combinations. These results indicated that SCJNMF had a lower accuracy when using similar gene expression profiles (e.g., Salmon & kallisto combination). Therefore, two gene expression profiles with different properties, having a lower correlation of cells and genes, were suitable for SCJNMF.
Size effect for SCJNMF and NMF
To further understand the benefit of NMFbased methods, the effect of sample size (the number of cells in the RNAseq library) against clustering accuracy was evaluated. We randomly subsampled the 25%, 50%, and 75% cells from gene expression profiles, and ran SCJNMF (λ_{2} = 10) and LSNMF. Each trial was performed ten times.
Figure 5 shows the ARI of the cell clustering using LSNMF and SCJNMF for each dataset and subsample rate. As a result, for each dataset, the accuracies of both LSNMF and SCJNMF were increased depends on the subsample rate. This result indicated that the methods based on NMF benefit more from a dataset including a larger number of cells.
Gene analysis using factorized matrix
We found marker genes of the Xin dataset using factorized matrices. The Xin dataset contained data on 1492 single cells and four classes (alpha, beta, delta, and PP) in the pancreas. We showed the results in Fig. 6. The factor of the coefficient matrix showed some characteristic patterns in each cell cluster (e.g., Factor 2 and Factor 5 showed higher values in delta and PP cells) (Fig. 6A). Next, we calculated the correlation of factors between the basis matrices in the common genes (Fig. 6B). The factors in the basis matrices showed a similar tendency. We also showed the loadings of marked factors for delta cells in the coefficient matrix in Fig. 6C. Almost all genes showed similar factor loadings between base1 and base2; however, we confirmed that some of the genes only observed in either gene expression profile also had high values. We also showed the marker genes of delta cells detected using Scanpy Wolf, Angerer & Theis (2018) in the scatter plot. The factor loadings of these genes tended to be higher than those of others, regardless of whether the gene was observed in both gene expression profiles or not. All factors of Xin dataset were shown in Fig. S2.
Discussion
Thus far, we have shown the possibility that our unsupervised clustering method had high accuracy when using the differences in gene expression quantification methods compared with previous studies. However, the proposed method showed worse accuracy than SC3 in the original Treutlein dataset, as well as Seurat in the original Segerstolpe dataset, and showed differences in the accuracy due to the differences in regularization parameters in the Pollen dataset. Treutlein and Pollen datasets have fewer cells than the others, which was one of the most important features of our experiments. Dataset size is an important factor (common to machine learning approaches) for high accuracy, and this characteristic is also applied to our method.
We showed the effective combinations of quantification methods by comparing the accuracy of cell clustering for each combination. In particular, we suggested that the combination of gene expression profiles that have similar properties (e.g., Salmon & kallisto) had lower accuracy in SCJNMF. Compared to the combination, including STAR that maps RNAseq reads to a reference genome, Salmon and kallisto are similar methods in the quantification algorithm, as these are alignmentfree quantification methods. Therefore, these gene expression profiles have similar properties. In SCJNMF using similar gene expression profiles, it is difficult to separate common factors derived from cells (H) and factors derived from genes (W_{1}, W_{2}). Meanwhile, although it has been reported that the pseudoalignment method loses many reads, leading to a lower mean expression (Vieth et al., 2019), it is possible to improve the clustering performance by incorporating them into our method with the other quantification method.
We presented more details about the characteristics of the factorized matrix and the relationships of marker genes. The coefficient matrix showed characteristic factors in each cell cluster, and both basis matrices had similar factors. The marker genes of a cell cluster showed high factor loadings in the basis matrices that characterized a specific cell cluster in the coefficient matrix, regardless of including both gene expression profiles or only one. In other words, factor loadings in basis matrices that characterize a specific cell cluster in the coefficient matrix are related to the marker genes for that cluster. This result suggested that genes showing high factor loadings in the basis matrices probably had some important features in the cluster with high factor loadings in the coefficient matrix. In particular, we should pay particular attention to those genes observed only in either gene expression profile because they are not considered in conventional methods.
In summary, our method is effective in the following cases.

The number of cells is sufficient large.

Different quantification methods yield gene expression profiles with different characteristics.
Meanwhile, there are some limitations. As shown in the result of CellBench dataset, the number of cells is quite large and the nature of the cell type is well defined, jointless NMF gives sufficiently good results. In such a case, our jointNMF may deteriorate the stability of the solution. Moreover, we also observed cases that the jointNMF worsened the accuracy (“salmon & kallisto” combination in Pollen dataset, Fig. 3), although the effect of the “joint” was either better or unchanged in many cases. It should be avoided to input expression profiles quantified by similar approaches.
Conclusion
We proposed SCJNMF, which performs cell clustering using common factors extracted from multiple gene expression profiles quantified using different methods. As a result, it is possible to perform robust analysis compared with the case in which only a single quantification method is used because it is unnecessary to consider the differences in gene expression profiles. The accuracy (ARI) of cell clustering obtained using our method was higher than that of other major clustering methods. Additionally, we showed that the combination of different quantification methods increases the accuracy of cell clustering compared to that of a similar quantification. Moreover, we showed the details of the extracted factors. The genes characteristic to specific cell groups (marker genes) showed remarkable factor loadings in terms of the factorized matrices; in other words, these results suggest a potential for identifying important genes in the dataset. Some genes may not be counted depending on the quantification methods used; they can be detected using multiple gene expression profiles generated using different quantification methods and SCJNMF if they are potential markers.
Supplemental Information
Loss and sparseness of the matrix H in each rank
We performed the experiment in two ways with different regularization parameters λ_{2} = 1 and λ_{2} = 10.
Scatterplot with density for all factors decomposed by using SCJNMF (Xin dataset)
Points show the genes and their color indicated the density (the number of neighbors).