Background: A major goal of RNA-Seq data analysis is to reconstruct the full set of gene transcripts expressed in a biological sample in order to quantify their expression levels. The process typically involves multiple steps including mapping short sequence reads to a reference genome, and estimating expression levels based on these mappings. Multiple algorithms and approaches for each processing step exist, and the impact of different methods on estimation of gene expression is not entirely clear.
Methods: We evaluated the impact of three common mapping algorithms on differential expression analysis in an RNA-Seq dataset describing the lung response to acute neonatal hyperoxia. RNA-Seq data generated using the Illumina platform were mapped and aligned using CASAVA, TopHat, and SHRiMP against the mouse genome. Significance Analysis of Microarrays and Cuffdiff were used to identify differentially expressed genes between hyperoxia-challenged and age matched control mice.
Results: 1403 genes were detected as differentially expressed by least one mapping and gene selection method. A majority of genes (>65%) were identified by all three mapping methods, regardless of the gene selection approach. Expression patterns for 52 genes were examined by quantitative polymerase chain reaction (qPCR). Importantly, we found different validation rates for genes selected by each method; 72% for CASAVA, 69% for TopHat and 63% for SHRiMP. Surprisingly, the validation rate for genes selected by all three mapping methods was no greater than the best single method.
Conclusion: The choice of mapping strategy impacts the reliability of gene selection for RNA-Seq data analysis.