PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets

Vito Adrian Cantu; Jeffrey Sadural; Robert Edwards

doi:10.7287/peerj.preprints.27553v1

PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets

Vito Adrian Cantu ¹, Jeffrey Sadural², Robert Edwards ²

1 Computational Science Research Center, San Diego State University, San Diego, California, United States

2 Department of Computer Science, San Diego State University, San Diego, California, United States

DOI: 10.7287/peerj.preprints.27553v1

Published: 2019-02-27
Accepted: 2019-02-27

Subject Areas: Bioinformatics, Genomics, Computational Science
Keywords: Bioinformatics, Computational Biology, Software Engineering, Distributed and Parallel Computing

Copyright: © 2019 Cantu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Cantu VA, Sadural J, Edwards R. 2019. PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets. PeerJ Preprints 7:e27553v1 https://doi.org/10.7287/peerj.preprints.27553v1

Abstract

PRINSEQ++ is a C++ implementation of the very popular software prinseq-lite for quality control and preprocessing of sequencing datasets. PRINSEQ++ can run multi-threaded processes, which makes it more than 10 times faster than the original version. It can read from, and write to, compressed files, drastically reducing the use of hard-drive. PRINSEQ++ can filter, trim and reformat sequences by a variety of options to improve downstream analysis. PRINSEQ++ is freely available on GitHub (https://github.com/Adrian-Cantu/PRINSEQ-plus-plus) and runs on all Unix-like systems.

Author Comment

This is a preprint submission to PeerJ Preprints.

Supplemental Information

Raw data for timing experiment

Each row is a timing measurement for some input size/ number of threads combination

DOI: 10.7287/peerj.preprints.27553v1/supp-1

Download

Summary statistics used to plot figure 1

Each rows indicates the average time, standard deviation, standard error, and .95 confidence interval for the timing measurements for each input size/number of threads combination. This data is derived from sup_table1

DOI: 10.7287/peerj.preprints.27553v1/supp-2

Download

Code to plot figure1

jupyter notebook of the code used to plot figure1. This is also available in the github repository

DOI: 10.7287/peerj.preprints.27553v1/supp-3

Download

Runtime comparison, Run-time of prinseq-lite and PRINSEQ++ was measured on several FASTQ pair files of different sizes with equivalent options. PRINSEQ++ was run with different number of threads, prinseq-lite single-threaded. Mean speedup of PRINSEQ++ ove

DOI: 10.7287/peerj.preprints.27553v1/supp-4

Download