pcr: an R package for quality assessment, analysis and testing of qPCR data
- Published
- Accepted
- Received
- Academic Editor
- Sahil Seth
- Subject Areas
- Bioinformatics, Cell Biology, Computational Biology, Molecular Biology
- Keywords
- qPCR, R package, Data analysis, Quality assessment
- Copyright
- © 2018 Ahmed and Kim
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. pcr: an R package for quality assessment, analysis and testing of qPCR data. PeerJ 6:e4473 https://doi.org/10.7717/peerj.4473
Abstract
Background
Real-time quantitative PCR (qPCR) is a broadly used technique in the biomedical research. Currently, few different analysis models are used to determine the quality of data and to quantify the mRNA level across the experimental conditions.
Methods
We developed an R package to implement methods for quality assessment, analysis and testing qPCR data for statistical significance. Double Delta C_{T} and standard curve models were implemented to quantify the relative expression of target genes from C_{T} in standard qPCR control-group experiments. In addition, calculation of amplification efficiency and curves from serial dilution qPCR experiments are used to assess the quality of the data. Finally, two-group testing and linear models were used to test for significance of the difference in expression control groups and conditions of interest.
Results
Using two datasets from qPCR experiments, we applied different quality assessment, analysis and statistical testing in the pcr package and compared the results to the original published articles. The final relative expression values from the different models, as well as the intermediary outputs, were checked against the expected results in the original papers and were found to be accurate and reliable.
Conclusion
The pcr package provides an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.
Introduction
Real-time quantitative PCR (qPCR) is a commonly used technique to analyze the relative gene expression level in the biomedical research. In most cases, small scale experiments are designed to quantify the level of mRNA among experimental conditions. Some advanced machines and optimized protocols have simplified the experiments to a highly efficient one-step process, allowing the effective analysis of a large scale of qPCR data. However, all processes for assessing the quality of the data, performing the analysis and reporting the results are not done in the most uniform way across the literature (Bustin & Nolan, 2004). Different analysis models have been proposed and implemented in different software environments (Pabinger et al., 2014). Furthermore, requirements and guidelines for reporting qPCR data were independently introduced (Bustin et al., 2009).
In this report, we introduce an open source R package for performing quality assessment, modeling and testing for statistical significance of qPCR data in a uniform way. In its current version, the pcr package implement two methods for relative quantification of mRNA expression proposed originally by Livak & Schmittgen (2001), in addition to the necessary steps to check the assumption of these methods. Also, we implement a number of methods to check for statistical significance in qPCR data which were introduced in SAS by Yuan et al. (2006). Finally, the package provides unified interface to make the analysis accessible and the ability to make graphs of the different analysis steps for visual inspection and preparation of publication-level figures. We start by describing the process for generating the data in the original papers, briefly introduce the methods and apply them to the original data using the pcr.
Materials & Methods
Data sources
To illustrate the usage of the pcr package and to apply it to qPCR data, we used real qPCR datasets from two published papers. In addition, we compared the results obtained by the pcr package to that of the original paper to ensure the reliability. In their original article, Livak & Schmittgen (2001) collected total RNA from human tissues; brain and kidney. c-myc and GAPDH primers were then used for cDNA synthesis and used as input in the PCR reaction. Seven different dilutions where used as input to the PCR reaction (three replicates each), this dataset was referred to as ct3 and shown in Table 1. Six replicates for each tissue were run in separate tubes. This dataset was referred to as ct1 through this document and shown along with the difference calculations in Tables 2 and 3. At the second work, Yuan et al. (2006) extracted total RNA from Arabidopsis thaliana plant treated and control samples (24 samples each), and performed qPCR analyses using MT7 and ubiquitin primers. This dataset was referred to as ct4 and shown the results of the different testing methods that applied in the original paper in Table 4. In all datasets, the raw C_{T} (Cycle Threshold), also known as C_{q}, was recorded (Bustin et al., 2009) and will be used as input to the pcr analysis pipeline.
Input RNA (ng) | c-myc average C_{T} | GAPDH average C_{T} | ΔC_{T} c-myc-GAPDH |
---|---|---|---|
1.0 | 25.59 ± 0.04 | 22.64 ± 0.03 | 2.95 ± 0.05 |
0.5 | 26.77 ± 0.09 | 23.73 ± 0.05 | 3.04 ± 0.10 |
0.2 | 28.14 ± 0.05 | 25.12 ± 0.10 | 3.02 ± 0.11 |
0.1 | 29.18 ± 0.13 | 26.16 ± 0.02 | 3.01 ± 0.13 |
0.05 | 30.14 ± 0.03 | 27.17 ± 0.06 | 2.97 ± 0.07 |
0.02 | 31.44 ± 0.16 | 28.62 ± 0.10 | 2.82 ± 0.19 |
0.02 | 32.42 ± 0.12 | 29.45 ± 0.08 | 2.97 ± 0.14 |
Tissue | c-myc C_{T} | GAPDH C_{T} | ΔC_{T} c-myc-GAPDH | ΔΔC_{T}ΔC_{T} − ΔC_{T,Brain} | c-myc_{N} Rel. to Brain |
---|---|---|---|---|---|
Brain | 30.72 | 23.7 | |||
30.34 | 23.56 | ||||
30.58 | 23.47 | ||||
30.34 | 23.65 | ||||
30.5 | 23.69 | ||||
30.43 | 23.68 | ||||
Average | 30.49 ± 0.15 | 23.63 ± 0.09 | 6.86 ± 0.17 | 0.00 ± 0.17 | 1.0 (0.9–1.1) |
Kidney | 27.06 | 22.76 | |||
27.03 | 22.61 | ||||
27.03 | 22.62 | ||||
27.1 | 22.6 | ||||
26.99 | 22.61 | ||||
26.94 | 24.18 | ||||
Average | 27.03 ± 0.06 | 22.66 ± 0.08 | 4.37 ± 0.10 | −2.50 ± 0.10 | 5.6 (5.3–6.0) |
Tissue | c-myc (ng) | GAPDH (ng) | c-myc_{N} norm. to GAPDH | c-myc_{N} Rel. to Brain |
---|---|---|---|---|
Brain | 0.033 | 0.51 | ||
0.043 | 0.56 | |||
0.036 | 0.59 | |||
0.043 | 0.53 | |||
0.039 | 0.51 | |||
0.040 | 0.52 | |||
Average | 0.039 ± 0.004 | 0.54 ± 0.034 | 0.07 ± 0.008 | 1.0 ± 0.12 |
Kidney | 0.40 | 0.96 | ||
0.41 | 1.06 | |||
0.41 | 1.05 | |||
0.39 | 1.07 | |||
0.42 | 1.06 | |||
0.43 | 0.96 | |||
Average | 0.41 ± 0.016 | 1.02 ± 0.052 | 0.40 ± 0.025 | 5.5 ± 0.35 |
Test | ΔΔC_{T} (estimate) | p-value | Confidence interval |
---|---|---|---|
Multiple regression | −0.6848 | <0.0001 | (−0.4435, −0.9262) |
ANOVA | −0.6848 | <0.0001 | (−0.4435, −0.9262) |
t-test | −0.6848 | <0.0001 | (−0.4147, −0.955) |
Wilcoxon test | −0.6354 | <0.0001 | (−0.4227, −0.8805) |
Statistical methods
In contrast with the absolute quantification of the amount of mRNA in a sample, the relative quantification uses a internal control (reference gene) and/or a control group (reference group) to quantify the mRNA of interest relative to these references. This relative quantification was sufficient to draw conclusions in most of the biomedical applications involving qPCR. A few methods were developed to perform these relative quantification. These methods generally require different assumptions and models for the analysis. The most common two of these methods were described here in the following sections.
The comparative C_{T} methods
The comparative C_{T} methods assume that the cDNA templates of the gene/s of interest as well as the control/reference gene have similar amplification efficiency, and also that this amplification efficiency is near perfect. This means that, at a certain threshold during the linear portion of the PCR reaction, the amount of the gene of the interest and the control double each cycle. Another assumption is that the expression difference between two genes or two samples can be captured by subtracting one (gene or sample of interest) from another (reference). The final assumption is that the reference doesn’t change with the treatment or the course in question. The formal derivation of the double delta C_{T} model is described here Livak & Schmittgen (2001). Briefly, the ΔΔC_{T} is given by: (1)$\Delta \Delta {C}_{T}=\Delta {C}_{T,q}-\Delta {C}_{T,cb}$
And the relative expression by: (2)$2}^{-\Delta \Delta {C}_{T}$ where ΔC_{T,q} is the difference in the C_{T} (or their average) of a gene of interest and a reference gene in a group of interest. ΔC_{T,cb} is the the difference in the C_{T} (or their average) of a gene of interest and a reference gene in a reference group. The error term is given by: (3)$s=\sqrt{{s}_{1}^{2}+{s}_{2}^{2}}$ where s_{1} is the standard deviation of a gene of interest and s_{2} is the standard deviation of a reference gene.
Standard curve
In comparison, this model doesn’t assume perfect amplification but rather actively use the amplification in calculating the relative expression. Therefore, when the amplification efficiency of all genes are 100% both methods should give similar results. The standard curve method is applied using two steps. First, serial dilutions of the mRNAs from the samples of interest are used as input to the PCR reaction. The linear trend of the log input amount and the resulting C_{T} values for each gene are used to calculate an intercept and a slope. Secondly, these intercepts and slopes are used to calculate the amounts of mRNA of the genes of interest and the control/reference in the samples of interest and the control sample/reference. These amounts are finally used to calculate the relative expression in a manner similar to the later method, just using division instead of subtraction. The formal derivation of the model is described here (Yuan et al., 2006). Briefly, the amount of RNA in a sample is given by: (4)$log\phantom{\rule{1em}{0ex}}\mathrm{amount}=\frac{{C}_{T}-b}{m}.$
And the relative expression is given by: (5)$1{0}^{log\phantom{\rule{1em}{0ex}}\mathrm{amount}}$ where C_{T} is the cycle threshold of a gene. b is the intercept of C_{T} log10 input amount. m is the slope of C_{T}. And the error term is given by: (6)$s=\left(cv\right)\left(\stackrel{\u0304}{X}\right)$ where: (7)$cv=\sqrt{c{v}_{1}^{2}+c{v}_{2}^{2}}$ where s is the standard deviation. $\stackrel{\u0304}{X}$ is the average. cv is the coefficient of variation or relative standard deviation.
Statistical significance tests
Assuming that the assumptions of the first methods are holding true, the simple t-test can be used to test the significance of the difference between two conditions (ΔC_{T}). t-test assumes, in addition, that the input C_{T} values are normally distributed and the variance between conditions are comparable. Wilcoxon test can be used if sample size is small, and those two last assumptions are hard to achieve.
Two use the linear regression here. A null hypothesis is formulated as following, (8)${C}_{T,\mathrm{target,treatment}}-{C}_{T,\mathrm{control,treatment}}={C}_{T,\mathrm{target,control}}-{C}_{T,\mathrm{control,control}}.$
This is exactly the ΔΔC_{T} value as explained earlier. So the ΔΔC_{T} is estimated and the null is rejected when ΔΔC_{T} ≠ 0.
Quality assessment
Fortunately, regardless of the method used in the analysis of qPCR data, The quality assessment can be done in a similar way. It requires an experiment similar to that of calculating the standard curve. Serial dilutions of the genes of interest and controls are used as input to the reaction and different calculations are made. The amplification efficiency is approximated be the linear trend between the difference between the C_{T} value of a gene of interest and a control/reference (ΔC_{T}) and the log input amount. This piece of information is required when using the ΔΔC_{T} model. Typically, the slope of the curve should be very small and the R^{2} value should be very close to one. A value of the amplification efficiency itself is given by 10^{−1∕slope} and should be close to 2. Other analysis methods are recommended when this is not the case. Similar curves are required for each gene using the C_{T} value instead of the difference for applying the standard curve method. In this case, a separate slope and intercept for each gene are required for the calculation of the relative expression.
Results & Discussion
Availability & installation
The pcr packages is available on CRAN, the main repository for R packages and can be installed by invoking install.packages in an R (≥3.4.2) session. The package’s source code is also available on github, https://github.com/MahShaaban/pcr along with the development version.
# install the pcr package from CRAN
install.packages('pcr')
The examples shown in this article are explained in greater details in the package vignette that can be accessed through browseVignette(’pcr’). Moreover, the package documentation provides detailed instruction on the input and the output of each function (e.g., ?pcr_analyze).
Functionality & user interface
The pcr package provides different methods for performing quality assessment, modeling and testing real-time qualitative PCR data through the unified interface of three functions: pcr_assess, pcr_analyze and pcr_test, respectively.
Quality assessment
pcr_assess provides two methods for assessing the quality of qPCR data. These are ‘efficiency’ and ‘standard_curve’ to calculate the amplification efficiency and gene standard curves as described in the methods section. The following code block applies both methods to the dataset ct3, shown in Table 1. Using the argument plot as TRUE in the pcr_assess function provides the a graphic presentation of the amplification and the standard curves as shown in Fig. 1.
# load required libraries
library(pcr)
library(ggplot2)
library(cowplot)
library(dplyr)
library(xtable)
library(readr)
# pcr_assess
## locate and read data
fl <− system.file('extdata', 'ct3.csv', package = 'pcr')
ct3 <− read_csv(fl)
## make a vector of RNA amounts
amount <− rep(c(1, .5, .2, .1, .05, .02, .01), each = 3)
## calculate amplification efficiency
res1 <− pcr_assess(ct3,
amount = amount,
reference_gene = 'GAPDH',
method = 'efficiency')
## calculate standard curves
res2 <− pcr_assess(ct3,
amount = amount,
method = 'standard_curve ')
## retain curve information
intercept <− res2$intercept
slope <− res2$slope
Analysis models
Similarly, pcr_analyze provides two methods to model the C_{T} values and calculates the relative expression of target genes. ’delta_delta_ct’ performs the ΔΔC_{T} method described previously. The average relative expression of the target gene in the condition of interest is given by the Eqs. (1) and (2) and the standard deviation by Eq. (3). The calculations are applied to the dataset ‘ct1’, shown in Table 2 and Fig. 2A. ‘relative_curve’ performs the relative standard curve quantification, average relative expression/amount of the target gene in the condition of interest is given by Eqs. (4) and (5) and the standard deviation by Eqs. (6) and (7). The calculation is applied to the same dataset ‘ct1’ and is shown in Table 3 and Fig. 2B.
# pcr_analyze
## locate and read raw ct data
fl <− system.file('extdata', 'ct1.csv', package = 'pcr')
ct1 <− read_csv(fl)
## add grouping variable
group_var <− rep(c('brain', 'kidney'), each = 6)
# calculate all values and errors in one step
## mode == 'separate_tube' default
res1 <− pcr_analyze(ct1,
group_var = group_var,
reference_gene = 'GAPDH',
reference_group = 'brain')
## calculate standard amounts and error
res2 <− pcr_analyze(ct1,
group_var = group_var,
reference_gene = 'GAPDH',
reference_group = 'brain',
intercept = intercept,
slope = slope,
method = 'relative_curve ')
Significance testing
Finally, pcr_test can be used to calculate useful statistics, p-values and confidence intervals on the previous models. Two tests are available of the two-group comparisons; ‘t.test’ and ‘wilcox.test’ to test for the difference of the normalized target gene expression (ΔC_{T}) in one condition to another. Linear regression, ‘lm’, can be applied to estimate these differences between multiple conditions and a reference (Eq. (8)). The following code applies different testing methods to the dataset ‘ct4’. The dataset was published original in (Yuan et al., 2006), along with results of different testing method (Table 4). Table 5 shows the results of the three different tests as implemented in pcr_test.
Gene | Estimate | p_value | Lower | Upper | Term | |
---|---|---|---|---|---|---|
t.test | target | −0.68 | 0.00 | −0.96 | −0.41 | |
wilcox.test | target | −0.64 | 0.00 | −0.88 | −0.42 | |
lm | target | −0.68 | 0.00 | −0.95 | −0.41 | group_vartreatment |
# pcr_test
# locate and read data
fl <− system.file('extdata', 'ct4.csv', package = 'pcr')
ct4 <− read_csv(fl)
# make group variable
group <− rep(c('control', 'treatment'), each = 12)
# test using t −test
tst1 <− pcr_test(ct4,
group_var = group,
reference_gene = 'ref',
reference_group = 'control',
test = 't.test')
# test using wilcox.test
tst2 <− pcr_test(ct4,
group_var = group,
reference_gene = 'ref',
reference_group = 'control',
test = 'wilcox.test')
# testing using lm
tst3 <− pcr_test(ct4,
group_var = group,
reference_gene = 'ref',
reference_group = 'control',
test = 'lm')
The pcr package workflow
In Fig. 3, we suggest a workflow based on three steps; Quality assessment, modeling and testing for statistical significance, these can be applied by using the pcr package functions; pcr_assess, pcr_analyze and pcr_test respectively. The workflow is based on the C_{T} from two kinds of experiments. First, the C_{T} values for the genes of interest and a reference gene in the different experimental groups. Second, the C_{T} values of the same genes from a serial dilution experiment. The amplification efficiency can be calculated by calculating the ΔC_{T} from the first and the slope and the intercept from the second experiment. The ΔΔC_{T} model and the relative expression are then calculated and tested. In the case of the two groups comparison, any of the testing methods can be applied otherwise the linear regression can be used for multiple group comparisons. Alternatively, the standard curve methods can be applied by calculating the amount of cDNA in each sample from the C_{T} values and the individual curves for each gene. These amounts are finally used to obtain the relative expression of the genes of interest.
Comparison with other packages
Pabinger et al. (2014) surveyed the tools used to analyze qPCR data across different platforms. They included nine R packages which provide very useful analysis and visualization methods. Table 6 shows a comparison of the features of these nine packages from the original publication, in addition to the pcr package. Some of these packages focuses one certain model and some are designed to handle high-throughput qPCR data (e.g., chipPCR). Most of these packages are hosted in CRAN and a few on the Bioconductor so they adhere to Bioconductor methods and data containers (e.g., qpcR).
In comparison, pcr provides a unified interface for different quality assessment, analysis and testing models. The different functions take similar inputs to perform different analysis steps which requires minimal formating at the end of the user. The main input and the output are tidy data.frame, and the package source code follows the tidyverse practices. This package targets the small scale qPCR experimental data and the R users. The interface and documentation choices were made with such users in mind and require no deep knowledge in specific data structures or complex statistical models. Users can go from the raw C_{T} values through different analysis steps to publication graphs in a few simple lines of code.
Limitations & future directions
The current version of the pcr package (1.1.0) provides only methods to estimate the expression of genes in a certain condition relative to another. Other methods were proposed for absolute quantification of the copy number of genes in samples (Whelan, Russell & Whelan, 2003). Also, the comparative C_{T} methods assume that the PCR reaction has a close to perfect amplification rates. Other methods were proposed to model the data when this assumption is not true (Liu & Saint, 2002; Tichopad et al., 2003). We are planning to implement methods for absolute quantification and dealing with less than perfect amplification efficiency cases in a future version of the package.
Conclusion
To sum, the pcr package is an open source R package for quality assessing, modeling and testing real-time quantitative PCR data. The package provide an intuitive and unified interface for its main functions to allow biologist to perform all necessary steps of qPCR analysis and produce graphs in a uniform way.