SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Evolutionary Studies, Genetics
- Keywords
- Next generation sequencing (NGS), exon capture, Illumina, FASTQ, contig, allele phasing, phylogenetics, phylogeography, BAM, assembly
- Copyright
- © 2018 Andermann et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. SECAPR - A bioinformatics pipeline for the rapid and user-friendly processing of Illumina sequences, from raw reads to alignments. PeerJ Preprints 6:e26477v3 https://doi.org/10.7287/peerj.preprints.26477v3
Abstract
Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing platforms such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.
Author Comment
This new submission encompasses a slightly changed title and active links for download of supplemental data. We further restructured the manuscripts according to the PeerJ publication guidelines. Please find SECAPR download and installation instructions here: https://github.com/AntonelliLab/seqcap_processor/blob/master/documentation.ipynb
Supplemental Information
Appendix 1: Library preparation and sequencing of Geonoma sample data
Detailed description of the lab workflow.
Table S1: Biological data info
Reported are the IDs of the Geonoma sample data and the locality and collector information about each sample.
Table S2: Overview of recovered contigs
The table shows which exon locus had a matchign contig in each of the Geonoma samples.
Table S3: Overview of all contigs flagged by the find_traget_contigs function
The document contains a list of all contigs that were flagged by the find_target_contigs function. Listed for each sample are all exons which were excluded because of possible paralogy (more than one matching contig). Further for each sample are listed all contig IDs, which matched multiple exons, which we did not exclude in the example data, as they were mostly representing long contigs spanning across several adjacent exons.
Table S4: Overview of read-coverage for selected exon loci
The table contains an overview of the read-depth information for all 97 exon loci that had an average read-depth of at least 3 reads across all Geonoma samples.
Table S5: Exon index legend for Fig. 4
This table contains an overview of which exon locus corresponds to which index in Fig. 4 of the manuscript.