gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Juhana I Kammonen; Olli-Pekka Smolander; Lars Paulin; Pedro AB Pereira; Pia Laine; Patrik Koskinen; Jukka Jernvall; Petri Auvinen

doi:10.7287/peerj.preprints.3467v1

gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Juhana I Kammonen ¹, Olli-Pekka Smolander¹, Lars Paulin¹, Pedro AB Pereira¹, Pia Laine¹, Patrik Koskinen², Jukka Jernvall¹, Petri Auvinen¹

1 Institute of Biotechnology, University of Helsinki, Helsinki, Finland

2 Blueprint Genetics Ltd., Helsinki, Finland

DOI: 10.7287/peerj.preprints.3467v1

Published: 2017-12-15
Accepted: 2017-12-15

Subject Areas: Bioinformatics, Genetics, Genomics, Computational Science
Keywords: scaffolding, draft genomes, genome assembly, next-generation sequencing, long read technologies

Copyright: © 2017 Kammonen et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Kammonen JI, Smolander O, Paulin L, Pereira PA, Laine P, Koskinen P, Jernvall J, Auvinen P. 2017. gapFinisher: a reliable gap filling pipeline for SSPACE-LongRead scaffolder output. PeerJ Preprints 5:e3467v1 https://doi.org/10.7287/peerj.preprints.3467v1

Abstract

Unknown sequences, or gaps, are largely present in most published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while many computational tools exist partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding software that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the actual scaffolding. gapFinisher is based on controlled use of a gap filling tool called FGAP and works on all standard Linux/UNIX command lines. We conclude that performing the workflows of SSPACE-LongRead and gapFinisher enables users to fill gaps reliably. There is no need for further scrutiny of the existing sequencing data after performing the analysis.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

minidot (Li, 2016) plots of the six bacterial genomes at different stages of the assembly

DOI: 10.7287/peerj.preprints.3467v1/supp-1

Download

Mauve (Darling et al., 2004) alignments of the six bacterial genomes at different stages of the 10 assembly

DOI: 10.7287/peerj.preprints.3467v1/supp-2

Download

All de novo assembly, scaffolding and gap filling statistics for the model genomes

DOI: 10.7287/peerj.preprints.3467v1/supp-3

Download

Gap filling data used and FGAP (Piro et al., 2014) default test results reported for an unpublished draft genome of a marine mammal from the Phocidae family

DOI: 10.7287/peerj.preprints.3467v1/supp-4

Download