Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify closely related bacterial species in complex environments.
- Published
- Accepted
- Subject Areas
- Biodiversity, Bioinformatics, Genomics, Microbiology, Molecular Biology
- Keywords
- genomic similarity score, core genome, Streptococcus, comparative genomics
- Copyright
- © 2018 Barajas de la Torre et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify closely related bacterial species in complex environments. PeerJ Preprints 6:e26665v2 https://doi.org/10.7287/peerj.preprints.26665v2
Abstract
Background. Comparative genomics between closely related bacterial strains can distinguish important features determining pathogenesis, antibiotic resistance, and phylogenetic structure. The Streptococcus genus is relevant to public health and food safety and it is well-represented (>100 genomes) in databases of publicly available databases. Streptococci are cosmopolitan, with multiple sources of isolation, from humans to dairy products. The Streptococcus genus has been classified by morphology, serotypes, 16S rRNA gene, and Multi Locus Sequence Types (MLST). The Genomic Similarity Score (GSS) is proposed as a tool to quantify genome level relatedness between species of Streptococcus. The Streptococcus core genome can be used to assess strain specific abundances in metagenomic sequences.
Methods. A 16S rRNA gene phylogeny was calculated for 108 strains, belonging to 16 Streptococcus species and compared to a dendrogram using GSS pairwise distances for the same genomes. The core and pan-genome were calculated for these 108 genomes. The core genome sequences were analyzed and used as a resource to discriminate homologous fragment reads from closely related strains in metagenomic samples.
Results. A total of 404 proteins are shared by all 108 Streptococcus genomes, which is the core genome. The pairwise amino acid identity values of the core proteins for all the compared strains and outgroups are reported. Lower sequence identity variation (90-100%) is predominantly found in core clusters containing ribosomal and translation-related proteins. For 48 core proteins (11.8%) no functional assignment could be made and those proteins have larger sequence identity variations than other core proteins. The sequence identity of the core genome diminishes as GSS score between species decreases. The GSS dendrogram recovers most of the clades in the 16S rRNA gene phylogeny while distinguishing between 16S polytomies (unresolved nodes). Finally, the core genome was used to distinguish between closely related species within human oral metagenomes.
Discussion. The Streptococcus genus provides a benchmark dataset for comparative genomic studies due to the breath depth of genomic coverage. Comparing metagenomic shotgun fragment reads to the core genome using rapid alignment tools allows species-specific abundance estimates in metagenomic samples. Understanding of genomic variability and strains relatedness is the goal of tools like GSS, which make use of both pairwise shared core and pan-genomic homologous shared sequences for its calculation.
Author Comment
New version after first round of peer-review.
Supplemental Information
Genome sequences and their environmental sources used in this work
Information included: Strains names, CDS number, NCBI accession numbers, isolation sources.
GSS calculations jupyter notebook
Detailed bioinformatic protocols.
ANI correlogram for the selected 108 streptococci strains
ANI correlogram for the selected 108 streptococci strains.
Core genome and pan-genome plots for the 108 streptococci strains
Core genome and pan-genome plots for the 108 streptococci strains.
Global pairwise identity for the core proteome of the streptococci
Global pairwise identity for the core proteome of the streptococci.
FASTA files for each streptococci species core genome
FASTA files for each streptococci species core genome.
Metagenomic abundances of Streptococcus in metagenomic samples
Calculated by core genome fragment recruitment, lowest common ancestor (LCA), and 16S rRNA gene abundances.