gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Institute of Biotechnology, University of Helsinki, Helsinki, Finland
Blueprint Genetics Ltd., Helsinki, Finland
DOI
10.7287/peerj.preprints.3467v1
Subject Areas
Bioinformatics, Genetics, Genomics, Computational Science
Keywords
scaffolding, draft genomes, genome assembly, next-generation sequencing, long read technologies
Copyright
© 2017 Kammonen et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Kammonen JI, Smolander O, Paulin L, Pereira PA, Laine P, Koskinen P, Jernvall J, Auvinen P. 2017. gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output. PeerJ Preprints 5:e3467v1

Abstract

Unknown sequences, or gaps, are largely present in most published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while many computational tools exist partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding software that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the actual scaffolding. gapFinisher is based on controlled use of a gap filling tool called FGAP and works on all standard Linux/UNIX command lines. We conclude that performing the workflows of SSPACE-LongRead and gapFinisher enables users to fill gaps reliably. There is no need for further scrutiny of the existing sequencing data after performing the analysis.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

minidot (Li, 2016) plots of the six bacterial genomes at different stages of the assembly

DOI: 10.7287/peerj.preprints.3467v1/supp-1

Mauve (Darling et al., 2004) alignments of the six bacterial genomes at different stages of the 10 assembly

DOI: 10.7287/peerj.preprints.3467v1/supp-2

All de novo assembly, scaffolding and gap filling statistics for the model genomes

DOI: 10.7287/peerj.preprints.3467v1/supp-3

Gap filling data used and FGAP (Piro et al., 2014) default test results reported for an unpublished draft genome of a marine mammal from the Phocidae family

DOI: 10.7287/peerj.preprints.3467v1/supp-4