## Estimating the frequency of multiplets in single-cell RNA sequencing from cell-mixing experiments Estimating the frequency of multiplets in single-cell RNA sequencing from cell-mixing experiments https://t.co/9bkWyHSM4N Estimating the frequency of multiplets in single-cell RNA sequencing from cell-mixing experiments https://t.co/TZ8bQGVDTw @thePeerJ
1. February 26, 2019: Minor Correction: The article contains a typo in Equations (2) and (3) that was helpfully identified by Lior Pachter. Specifically, both of these equations contained a single occurrence of exp(-mu1 + mu2) that should instead be exp(-mu1 - mu2). All subsequent equations in the paper as well as the code were correct.

## Introduction

Many methods for single-cell RNA sequencing involve partitioning cells into barcoded droplets (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017), wells (Gierahn et al., 2017), or combinations of wells (Cao et al., 2017). As long as the number of possible partitions exceeds the number of cells, then most partitions will contain at most one cell. However, some fraction of the non-empty partitions will contain multiple cells, and estimating this multiplet frequency is an important aspect of experimental quality control.

The most common method to determine the multiplet frequency is to mix two types of cells (e.g., human and mouse). During the analysis of the sequencing results, each non-empty partition can be identified as containing transcripts from one or both of the two cell types. Partitions that contain a substantial number of transcripts from both cell types must be multiplets. If the two cell types are mixed equally and the average number of cells per partition is low (so that most multiplets are doublets), then the multiplet frequency can be estimated as simply twice the fraction of non-empty partitions that contain a mix of cell types. The logic is that all the multiplets are doublets, and only half the doublets will have cells of both types (the others will have two cells of the same type). This approach has been used to estimate the multiplet frequency during the prototyping of most single-cell RNA sequencing methods (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Gierahn et al., 2017; Cao et al., 2017).

However, in some cases the two cell types may be mixed in unequal proportions. Unequal mixing could arise simply from error during cell counting, or it could be an intentional aspect of experimental design (Rosenberg et al., 2018). For instance, if the researcher is actually interested in the human cells and simply wants to include an internal control to estimate the multiplet frequency during each new experiment, then (s)he may want to add fewer mouse cells so that most of the resulting data is for the human cells. In addition, when analyzing naturally occurring mixtures of cells of multiple types, the different cell types will usually be present in unequal proportions. But when the cells are mixed unequally, it is no longer valid to estimate the multiplet frequency as simply twice the fraction of non-empty partitions that contain a mix of both cell types. Surprisingly, I could find no published descriptions of how to calculate the multiplet frequency from unequal mixes of two cell types. Here, I remedy this gap in the literature by deriving the equations to compute the multiplet frequency when the cells are mixed in arbitrary proportions under the assumption that the number of cells per partition is Poisson distributed. This Poisson assumption is accurate when cells are loaded randomly and independently into partitions.

## Methods

The LaTex source for this paper, the Jupyter notebooks that implement the calculations, and all materials associated with the writing and review of the paper are publicly available in a GitHub repository at https://github.com/jbloomlab/multiplet_freq. The Jupyter notebooks are also available in Files S1 and S3, and HTML renderings of the notebooks are in Files S2 and S3.

## Results

### Derivation of multiplet frequency from observed numbers of pure and mixed-cell droplets

Consider the case in which cells of two types (e.g., human and mouse) are distributed into individual barcoded droplets, although the same logic applies if the cells are distributed into barcoded wells or combinations of wells. Assume the sequencing data have been analyzed so that each non-empty droplet can be classified as containing at least one cell of type 1, at least one cell of type 2, or cells of both types. I will refer to the number of droplets in each of these three groupings as N1, N2, and N1,2, respectively. For instance, the 10× cellranger pipeline (version 2.1.1) returns these numbers as the “Estimated Number of Cell Partitions.”

The only assumption of the derivation is that the number of cells per droplet is Poisson distributed. Let μ1 be the average number of cells of type 1 per droplet, and μ2 be the average number of cells of type 2 per droplet. The average number of cells of any type per droplet is then μ1 + μ2. Therefore, the probability that a droplet contains at least one cell of any type is $\begin{array}{ll}\mathrm{Pr}\left(c\ge 1\right)\hfill & =1-\mathrm{Pr}\left(c=0\right)\hfill \\ \hfill & =1-{e}^{-{\text{μ}}_{1}-{\text{μ}}_{2}}.\hfill \end{array}$

Likewise, the probability that a droplet contains multiple cells of any type (e.g., a multiplet) is $\begin{array}{ll}\mathrm{Pr}\left(c\ge 2\right)\hfill & =1-\mathrm{Pr}\left(c=0\right)-Pr\left(c=1\right)\hfill \\ \hfill & =1-{e}^{-{\text{μ}}_{\text{1}}-{\text{μ}}_{\text{2}}}-\left({\text{μ}}_{\text{1}}{\text{+μ}}_{\text{2}}\right){e}^{-{\text{μ}}_{\text{1}}+{\text{μ}}_{\text{2}}}.\hfill \end{array}$

The multiplet frequency M is simply the probability that a droplet with at least one cell actually contains multiple cells, which is $\begin{array}{ll}M\hfill & =\frac{\mathrm{Pr}\left(c\ge 2\right)}{\mathrm{Pr}\left(c\ge 1\right)}\hfill \\ \hfill & =1-\frac{\left({\text{μ}}_{\text{1}}+{\text{μ}}_{\text{2}}\right){e}^{-{\text{μ}}_{\text{1}}+{\text{μ}}_{\text{2}}}}{1-{e}^{-{\text{μ}}_{1}-{\text{μ}}_{\text{2}}}}.\hfill \end{array}$

However, evaluating this expression for M requires the values of μ1 and μ2.

We can write down equations for μ1 and μ2 by again using the fact that the number of cells per droplet is Poisson distributed. Specifically, if N is the total number of droplets (empty and non-empty), then the expected number of droplets that have at least one cell of type 1 is $N\text{\hspace{0.17em}}×\text{\hspace{0.17em}}\mathrm{Pr}\left({c}_{1}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}1\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}N\left(1\text{\hspace{0.17em}}-\text{\hspace{0.17em}}{e}^{-{\text{μ}}_{1}}\right)$. The observed number of droplets with at least one cell of type 1 is N1, so setting the observed number equal to the expected number gives us an equation for μ1, ${N}_{1}=N\left(1-{e}^{-{\text{μ}}_{1}}\right).$

This equation is easily solved for μ1 to yield ${\text{μ}}_{1}=-\mathrm{ln}\left(\frac{N-{N}_{1}}{N}\right),$and likewise for μ2, ${\text{μ}}_{2}=-\mathrm{ln}\left(\frac{N-{N}_{2}}{N}\right).$

Equations (5) and (6) give us a way to determine the values (μ1 and μ2) needed to calculate the multiplet frequency (Eq. (3)) in terms of the experimental observables N1 and N2. Unfortunately, these two equations also require knowledge of the total (empty and non-empty) number of droplets N, which is not directly observable from the sequencing data.

However, we can take advantage of another relationship to calculate N. The fraction of all (empty and non-empty) droplets that contain cells of both types is $\frac{{N}_{1,2}}{N}$, and this fraction is simply the product of the probability that a droplet contains at least one cell of type 1 with the probability that a droplet contains at least one cell of type 2, which in mathematical terms can be stated as $\mathrm{Pr}\left({c}_{1}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}1\text{\hspace{0.17em}}\wedge \text{\hspace{0.17em}}{c}_{2}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}1\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\mathrm{Pr}\left({c}_{1}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}1\right)\text{\hspace{0.17em}}×\text{\hspace{0.17em}}\mathrm{Pr}\left({c}_{2}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}1\right)$. Therefore, $\frac{{N}_{1,2}}{N}=\frac{{N}_{1}}{N}×\frac{{N}_{2}}{N}.$

This equation can be solved to give $N=\frac{{N}_{1}{N}_{2}}{{N}_{1,2}},$ which can be completely evaluated in terms of the experimental observables. Equations (5), (6), and (8) can be used to calculate μ1 and μ2 in terms of the experimental observables, and those results used to calculate the multiplet frequency via Eq. (3). This provides an analytic solution for the multiplet frequency in terms of the three experimental observables.

### Implementation and example calculations

A simple function to perform the calculations described in the previous subsection is implemented in Python in the Jupyter notebook found at https://github.com/jbloomlab/multiplet_freq/blob/master/calcmultiplet.ipynb, and in R in the Jupyter notebook found at https://github.com/jbloomlab/multiplet_freq/blob/master/calcmultiplet_R.ipynb (see also Files S1S4). To illustrate the calculations, I used this function to calculate the multiplet frequency for hypothetical data.

First, consider hypothetical data in which the two types of cells are mixed in equal proportions. Prior papers have approximated the multiplet frequency from such experiments as simply twice the fraction of non-empty droplets that contain cells of both types (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Cao et al., 2017), which is $\frac{{N}_{1,2}}{{N}_{1}\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{N}_{2}\text{\hspace{0.17em}}-\text{\hspace{0.17em}}{N}_{1,2}}$ in the notation defined in the previous subsection. Table 1 shows that the exact equation derived in the previous subsection gives very similar results to this approximate method as long as the multiplet frequency is low. When the multiplet frequency becomes high, the approximate method starts to overestimate the true multiplet frequency, since it fails to account for the fact that some multiplets will contain more than two cells.

Next, consider hypothetical data in which the two types of cells are mixed in unequal proportions. Table 2 shows the multiplet frequencies for several such experiments. An interesting aspect of the results is that at high multiplet frequencies and very unequal cell proportions, the multiplet frequency is substantially lower than the fraction of droplets containing the rarer cell type that contain a mix of both cell types. The reason is that multiplets (particularly higher-order ones) become more and more likely to contain at least one cell of the rarer type relative to droplets that contain only one cell. For instance, in the final experiment in Table 2, two-thirds of the droplets containing mouse cells have a mix of both cell types, yet less than half the non-empty droplets are multiplets (the multiplet frequency is 0.459). This somewhat non-intuitive result illustrates the importance of using the correct mathematical relationship to calculate the multiplet frequency when cell types are mixed unequally.

## Conclusions

I have described how to calculate the multiplet frequency in single-cell RNA sequencing experiments in which two cell types are mixed in arbitrary proportions. It is important to note that this calculation requires that the sequencing data have already been analyzed to determine whether each partition contains a non-negligible number of transcripts from each cell type, but many common analysis programs (such as the 10× cellranger pipeline) already do this.

The calculation also assumes that the number of cells per droplets follows a Poisson distribution. While many single-cell RNA sequencing methods are designed to partition cells in a way that concords with this assumption (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Gierahn et al., 2017; Cao et al., 2017), it is possible that cell clumping or other factors could bias certain partitions to contain more cells than expected under a Poisson distribution. In such a scenario, the calculations in this paper would overestimate the true multiplet frequency if the clumping is equally likely across cell types, but could underestimate the true multiplet frequency if intra-cell-type clumping is more likely than inter-cell-type clumping.

Finally, the approach in this paper only calculates the multiplet frequency—it does not actually identify the multiplets so that they can be removed from downstream analyses. For that purpose, other more sophisticated approaches have been developed (Ilicic et al., 2016; Stoeckius et al., 2017; Kang et al., 2018; Wolock, Lopez & Klein, 2018; DePasquale et al., 2018). Nonetheless, simply calculating the multiplet frequency from the data returned by standard pipelines such as the 10× cellranger is important for many purposes, and the results here enable that to be done regardless of the proportions at which the cell types are mixed.

## Supplemental Information

### Supplemental file 1.

A Jupyter notebook that implements the calculations in Python, and does the calculations for the examples shown in the tables in this paper.

### Supplemental file 2.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 1.

### Supplemental file 3.

A Jupyter notebook that implements the calculations in R, and does the calculations for the examples shown in the tables in this paper.

### Supplemental file 4.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 3.

### Competing Interests

The author declares that he has no competing interests.

### Author Contributions

Jesse D. Bloom conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

### Data Availability

The following information was supplied regarding data availability:

### Funding

This work was supported by grants R01 GM102198 and R01 AI127893 from the National Institutes of Health. The work of the author is also supported in part by a Faculty Scholars grant from HHMI and the Simons Foundation. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.