Prioritizing bona fide bacterial small RNAs with machine learning classifiers

Erik JJ Eppenhof; Lourdes Peña-Castillo

doi:10.7287/peerj.preprints.26974v1

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

A peer-reviewed article of this Preprint also exists.

View peer-reviewed version

Prioritizing bona fide bacterial small RNAs with machine learning classifiers

Erik JJ Eppenhof¹, Lourdes Peña-Castillo ^2,3

1 Department of Artificial Intelligence, Radboud University Nijmegen, Nijmegen, Netherlands

2 Department of Biology, Memorial University of Newfoundland, St. John's, Canada

3 Department of Computer Science, Memorial University of Newfoundland, St. John’s, Canada

DOI: 10.7287/peerj.preprints.26974v1

Published: 2018-06-02
Accepted: 2018-06-02

Subject Areas: Bioinformatics, Microbiology, Data Mining and Machine Learning
Keywords: sRNA prioritization, Machine learning, sRNA characterization, Random forest, Multilayer perceptron, Boosting algorithms, Bacterial non-coding RNAs, sRNACharP, sRNARanking

Copyright: © 2018 Eppenhof et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Eppenhof EJ, Peña-Castillo L. 2018. Prioritizing bona fide bacterial small RNAs with machine learning classifiers. PeerJ Preprints 6:e26974v1 https://doi.org/10.7287/peerj.preprints.26974v1

Abstract

Bacterial small non-coding RNAs (sRNAs) are involved in the control of several cellular processes. Hundreds of putative sRNAs have been identified in many bacterial species through RNA sequencing. The existence of putative sRNAs is usually validated by Northern blot analysis. However, the large amount of novel putative sRNAs reported in the literature makes it impractical to validate in the wet lab each of them. In this work, we applied five machine learning approaches to construct twenty models to discriminate bona fide sRNAs from random genomic sequences in five bacterial species. Sequences were represented using seven features including free energy of their predicted secondary structure, their distances to the closest predicted promoter site and Rho-independent terminator, and their distance to the closest open reading frames (ORFs). To automatically calculate these features, we developed an sRNA Characterization Pipeline (sRNACharP). All sevens features used in the classification task contributed positively to the performance of the predictive models. The five best performing models obtained a median precision of 100% at 10% recall and of 60% at 40% recall across all five bacterial species. Our results suggest that even though there is limited sRNA sequence conservation across different bacterial species, there are intrinsic features of sRNAs that are conserved across taxa. We show that these features are exploited by machine learning approaches to learn a species-independent model to prioritize bona fide bacterial sRNAs.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Positive and negative sRNA instances of five bacterial species

Feature vectors as obtained by sRNACharP for positive (bona fide sRNA) and negative (random genomic sequences) instances of five bacterial species. These datasets were used to train and validate the machine learning models discussed in the manuscript.

DOI: 10.7287/peerj.preprints.26974v1/supp-1

Download

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

Supplemental Information

Positive and negative sRNA instances of five bacterial species

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article