PaPrBaG: A random forest approach for the detection of novel pathogens from NGS data

Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany
DOI
10.7287/peerj.preprints.2379v1
Subject Areas
Bioinformatics, Genomics, Microbiology, Statistics
Keywords
Pathogenicity prediction, Analysis of NGS data, Applied machine learning, Clinical microbiology, Novel pathogen detection
Copyright
© 2016 Deneke et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Deneke C, Rentzsch R, Renard BY. 2016. PaPrBaG: A random forest approach for the detection of novel pathogens from NGS data. PeerJ Preprints 4:e2379v1

Abstract

The reliable detection of novel bacterial pathogens from next generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from reference database used. Here, we present the random forest based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide set of species with known pathogenicity phenotype. To that end we generated a novel label source of pathogenic and non-pathogenic bacterial strains, using a rule-based protocol to annotate pathogenicity based on genome metadata. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads that are far away from currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. Combining PaPrBaG with existing approaches further improves prediction results.

Author Comment

This is an article which has been accepted for the "GCB 2016 Conference".

Supplemental Information

100 most important features

The top 100 most important features. Features are ranked by permutation importance and results are shown for the first fold of the five-fold cross-validation (with very similar results for the other folds). The four amino acid index (see references 41-43 for details) accessions occurring are (i) The Kerr effect of amino acids in water: The Kerr-constant increments [KHAG800101], (ii) Characterization of multiple bends in proteins: Normalized relative frequency of double bend [ISOY800107], (iii) Shape and surface features of globular proteins: Correlation coefficient in regression analysis [PRAM820103] and (iv) Protein secondary structure: Normalized frequency of beta-sheet in alpha+beta class [PALJ810111]. Furthermore, all k-mer features are listed including (and consider) the respective reverse complement.

DOI: 10.7287/peerj.preprints.2379v1/supp-1