Error estimates for the analysis of differential expression from RNA-seq count data

Conrad Burden; Sumaira Qureshi; Susan R Wilson

doi:10.7287/peerj.preprints.400v1

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

A newer version of this Preprint is available: View the latest version

Error estimates for the analysis of differential expression from RNA-seq count data

Conrad Burden ¹, Sumaira Qureshi¹, Susan R Wilson^1,2

1 Mathematical Sciences Institute, Australian National University, Canberra, Australia

2 School of Mathematics and Statistics, University of New South Wales, Sydney, Australia

DOI: 10.7287/peerj.preprints.400v1

Published: 2014-05-30
Accepted: 2014-05-30

Subject Areas: Bioinformatics
Keywords: RNA-seq, differential expression analysis, false discovery rates

Copyright: © 2014 Burden et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Burden C, Qureshi S, Wilson SR. 2014. Error estimates for the analysis of differential expression from RNA-seq count data. PeerJ PrePrints 2:e400v1 https://doi.org/10.7287/peerj.preprints.400v1

Abstract

A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values to be the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance edgeR (for ≤ \le 10 replicates per condition) or DESeq (for ≤ \le 6 replicates per condition) in our tests with synthetic data.

Supplemental Information

Figures

Figures S1 to S23 cited in the text

DOI: 10.7287/peerj.preprints.400v1/supp-1

Download

R code

R code for the Polyfit software, together with, as an example, the code used for generating Fig. 5, and a data file of mean and overdispersons used for generating the various synthetic data sets used in this paper.

DOI: 10.7287/peerj.preprints.400v1/supp-2

Download

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

Supplemental Information

Figures

R code

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article