Efficient duplicate rate estimation from subsamples of sequencing libraries

Genome Informatics and Human Genetics, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany
Computer Science XI, TU Dortmund, Dortmund, Germany
DOI
10.7287/peerj.preprints.1298v2
Subject Areas
Bioinformatics, Computational Biology
Keywords
DNA sequencing, library complexity, diversity estimation, occupancy distribution, linear programming
Copyright
© 2015 Schröder et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Schröder C, Rahmann S. 2015. Efficient duplicate rate estimation from subsamples of sequencing libraries. PeerJ PrePrints 3:e1298v2

Abstract

In high-throughput sequencing (HTS) projects, the sequenced fragments’ duplicate rate is a key quality metric. A high duplicate rate may arise from a low amount of input DNA and many PCR cycles. Many methods for downstream analyses require that duplicates be removed. If the duplicate rate is high, most of the sequencing effort and money spent would have been in vain. Therefore, it is of considerable interest to estimate the duplicate rate after sequencing only a small subsample at low depth (multiplexed with other libraries) for quality control before running the full experiment. In this article, we provide an elementary mathematical framework and an efficient computational approach based on quadratic and linear optimization to estimate the true duplicate rate from a small subsample. Our method is based on up-sampling the occupancy distribution of the reads’ copy numbers. Compared to an existing approach, we use an explicit and easily explained mathematical model that accurately inverts the sub-sampling process. We evaluate the performance of our approach in comparison to that of the existing method on several artificial and real datasets. The same ideas can be used for diversity estimation in general. Software implementing our approach is available under the MIT license.

Author Comment

This work has been presented at the German Conference on Bioinformatics 2015. For v2, several typos and grammatical errors were corrected.