Efficient duplicate rate estimation from subsamples of sequencing libraries

Christopher Schröder; Sven Rahmann

doi:10.7287/peerj.preprints.1298v2

Efficient duplicate rate estimation from subsamples of sequencing libraries

Christopher Schröder ¹, Sven Rahmann^1,2

1 Genome Informatics and Human Genetics, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany

2 Computer Science XI, TU Dortmund, Dortmund, Germany

DOI: 10.7287/peerj.preprints.1298v2

Published: 2015-09-18
Accepted: 2015-09-18

Subject Areas: Bioinformatics, Computational Biology
Keywords: DNA sequencing, library complexity, diversity estimation, occupancy distribution, linear programming

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Schröder C, Rahmann S. 2015. Efficient duplicate rate estimation from subsamples of sequencing libraries. PeerJ PrePrints 3:e1298v2 https://doi.org/10.7287/peerj.preprints.1298v2

Abstract

In high-throughput sequencing (HTS) projects, the sequenced fragments’ duplicate rate is a key quality metric. A high duplicate rate may arise from a low amount of input DNA and many PCR cycles. Many methods for downstream analyses require that duplicates be removed. If the duplicate rate is high, most of the sequencing effort and money spent would have been in vain. Therefore, it is of considerable interest to estimate the duplicate rate after sequencing only a small subsample at low depth (multiplexed with other libraries) for quality control before running the full experiment. In this article, we provide an elementary mathematical framework and an efficient computational approach based on quadratic and linear optimization to estimate the true duplicate rate from a small subsample. Our method is based on up-sampling the occupancy distribution of the reads’ copy numbers. Compared to an existing approach, we use an explicit and easily explained mathematical model that accurately inverts the sub-sampling process. We evaluate the performance of our approach in comparison to that of the existing method on several artificial and real datasets. The same ideas can be used for diversity estimation in general. Software implementing our approach is available under the MIT license.

Author Comment

This work has been presented at the German Conference on Bioinformatics 2015. For v2, several typos and grammatical errors were corrected.