Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify close related bacterial strains in complex environments
- Published
- Accepted
- Subject Areas
- Biodiversity, Bioinformatics, Genomics, Microbiology, Molecular Biology
- Keywords
- genomic similarity score, core genome, Streptococcus, comparative genomics
- Copyright
- © 2018 Barajas de la Torre et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify close related bacterial strains in complex environments. PeerJ Preprints 6:e26665v1 https://doi.org/10.7287/peerj.preprints.26665v1
Abstract
Background. Comparative genomics between closely related bacterial strains aids to distinguish important features like pathogenesis, antibiotic resistance, and phylogenetic structure. Streptococcus is relevant because public health and food safety and it are well-represented (>100 genomes ) in databases of publicly available databases. Streptococci are cosmopolitan, and there are multiple sources of isolation, from humans to dairy products. The Streptococcus have been classified by morphology, serum types, 16S rRNA gene, and Multi Locus Sequence Types (MLST). The Genomic Similarity Score (GSS) is proposed as a tool to quantify genome level relatedness between Streptococcus and using their core genome as a simplified tool to assess strain specific abundances in metagenomic sequences.
Methods. A 16S rRNA gene phylogeny has been calculated for 108 strains, belonging to 16 Streptococcus species and compared the results to a dendrogram using the GSS with all homologous shared information available in the genomes. Additionally, genus core and pan-genome were calculated. The core genome sequences identity was analyzed and the core genome was used as a seed to discriminate abundances between close related strains in metagenomic samples.
Results. A total of 404 proteins are shared by all 108 Streptococcus genomes, which are the core genome. The core identity values ranges across all the compared strains and outgroups are reported. Lower sequence identity variation (90-100%) within the core belongs to ribosomal and translation-related proteins. It was found out that 48 proteins (11.8%) of the core genome are considered a hypothetical protein and those proteins host the larger sequence identity variations within the core. The sequence identity of the core genome identity diminishes as GSS score between species increases. The GSS dendrogram recovers most of the clades in the 16S rRNA gene phylogeny with the advantage to distinguish between 16S polytomies (unresolved nodes). Finally, our proposed core genome was used to distinguish the abundances of close related strains within human oral metagenomes being able to get strain relative abundances between healthy and caries infected (with S. mutans) individuals.
Discussion. The clinical and food safety importance of Streptococcus genus gives a playground to test multiple comparative genomic scenarios due to its excellent genomic coverage. Understanding of genomic variability and strains relatedness is the goal of tools like GSS, which make use of both pairwise shared core and pan-genomic homologous shared sequences for its calculation. Combination of core genome and rapid alignment tools allows to estimate abundance and discriminate in a strain-specific manner in metagenomic samples. Here it is shared with the community both GSS genomic dendrogram and core genome to explore possibilities within streptococci.
Author Comment
This is a submission to PeerJ for review.
Supplemental Information
Genome sequences and their environmental sources used in this work
Information included: Strains names, CDS number, NCBI accession numbers, isolation sources.
GSS calculations jupyter notebook
Detailed bioinformatic protocols.
Core genome and pan-genome plots for the 108 streptococci strains
Additionally, each streptococci species core genome and orthologous genes shared between strains.
Global pairwise identity for the core proteome of the streptococci
FASTA files for each streptococci species core genome
Metagenomic abundances of Streptococcus in metagenomic samples
Calculated by core genome fragment recruitment, lowest common ancestor (LCA), and 16S rRNA gene abundances.