From trash to treasure: detecting unexpected contamination in unmapped NGS data

llaria Granata; Mara Sangiovanni; Amarinder Singh Thind; Mario Rosario Guarracino

doi:10.7287/peerj.preprints.3230v1

From trash to treasure: detecting unexpected contamination in unmapped NGS data

llaria Granata , Mara Sangiovanni , Amarinder Singh Thind, Mario Rosario Guarracino

High Performance Computing and Networking Institute, National Research Council of Italy, Napoli, Italy

DOI: 10.7287/peerj.preprints.3230v1

Published: 2017-09-06
Accepted: 2017-09-06

Subject Areas: Bioinformatics, Genomics, Computational Science
Keywords: Contaminating sequences, Unmapped reads, NGS data

Copyright: © 2017 Granata et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Granata l, Sangiovanni M, Thind AS, Guarracino MR. 2017. From trash to treasure: detecting unexpected contamination in unmapped NGS data. PeerJ Preprints 5:e3230v1 https://doi.org/10.7287/peerj.preprints.3230v1

Abstract

Standard procedures for NGS data analysis involve a pre-processing step of reads quality assessment, followed by the alignment of the filtered reads to a reference genome. Typically the amount of reads that correctly maps to the specific reference genome ranges between 70% and 90%, leaving in some cases a consistent fraction of unmapped sequences. Investigating the reasons of this discrepancy may provide relevant information about the source of the so called unmapped reads. It is not unusual that genetic material of microorganisms is present in biological samples undergoing sequencing. These exogenous sequences can derive from the normal or altered tissues microbiome (upstream contamination) or from a contamination occurring during the samples processing (downstream contamination).

Here we propose DecontaMiner, a tool to unravel the presence of contaminating sequences among the unmapped reads. It uses a subtraction approach in which the sequences are first filtered according to quality parameters and then mapped to ribosomal, mithocondrial and foreign organism's databases. The reads that do not map on human genome are then mapped, through a local alignment algorithm (MegaBlast), to bacteria, fungi and viruses genome. DecontaMiner generates several output files to track all the processed reads, and to provide a complete report of their characteristics. The good quality matches on microorganism genomes are counted and compared among samples. The main novelty of DecontaMiner is the versatility of its use together with a complete, easy to use, and automatic pipeline.

DecontaMiner has been mainly used to detect contamination in human RNA-seq data, but the pipeline can be easily tailored using the configuration files and flags to process DNA-seq data, and unmapped data coming from non-human species.

Author Comment

This is an abstract which has been accepted for the NETTAB 2017 Workshop