This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
In high-throughput sequencing (HTS) projects, the sequenced fragments’ duplicate rate is a key quality metric. A high duplicate rate may arise from a low amount of input DNA and many PCR cycles. Many methods for downstream analyses require that duplicates be removed. If the duplicate rate is high, most of the sequencing effort and money spent would have been in vain. Therefore, it is of considerable interest to estimate the duplicate rate after sequencing only a small subsample at low depth (multiplexed with other libraries) for quality control before running the full experiment. In this article, we provide an elementary mathematical framework and an efficient computational approach based on quadratic and linear optimization to estimate the true duplicate rate from a small subsample. Our method is based on up-sampling the occupancy distribution of the reads’ copy numbers. Compared to an existing approach, we use an explicit and easily explained mathematical model that accurately inverts the sub-sampling process. We evaluate the performance of our approach in comparison to that of the existing method on several artificial and real datasets. The same ideas can be used for diversity estimation in general. Software implementing our approach is available under the MIT license.
This work has been presented at the German Conference on Bioinformatics 2015. For v2, several typos and grammatical errors were corrected.