Prioritizing bona fide bacterial small RNAs with machine learning classifiers
- Published
- Accepted
- Subject Areas
- Bioinformatics, Microbiology, Data Mining and Machine Learning
- Keywords
- sRNA prioritization, Machine learning, sRNA characterization, Random forest, Multilayer perceptron, Boosting algorithms, Bacterial non-coding RNAs, sRNACharP, sRNARanking
- Copyright
- © 2018 Eppenhof et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Prioritizing bona fide bacterial small RNAs with machine learning classifiers. PeerJ Preprints 6:e26974v1 https://doi.org/10.7287/peerj.preprints.26974v1
Abstract
Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.
Author Comment
This is a submission to PeerJ for review.
Supplemental Information
Positive and negative sRNA instances of five bacterial species
Feature vectors as obtained by sRNACharP for positive (bona fide sRNA) and negative (random genomic sequences) instances of five bacterial species. These datasets were used to train and validate the machine learning models discussed in the manuscript.