This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Sahl JW, Caporaso G, Rasko DA, Keim PS. (2014) The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ PrePrints2:e220v1https://doi.org/10.7287/peerj.preprints.220v1
Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the large-scale, flexible, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method,the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 minutes using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in ~60h using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.
LS-BSR is a method to not only compare the complete genetic content between bacterial genomes, but can also be used for screening a set of genomes for a known set of genes. The results were designed to be easily visualized for clear depictions of gene differences between isolates. Additional scripts are included to parse the resulting LS-BSR matrix, which should help with rapidly comparing the genetic content between groups (e.g. pathogenic vs. commensal isolates).
Core genome SNP phylogeny of 96 E. coli/Shigella genomes
The core genome was extracted from the output of Mugsy (Angiuoli & Salzberg 2010) and the phylogeny was inferred with FastTree2 (Price et al. 2010) . This phylogeny contains labels that can be used to identify specific genomes in Figures 2 and 3.
"Following" is like subscribing to any updates related to a preprint.
These updates will appear in your home dashboard each time you visit PeerJ.
You can also choose to receive updates via daily or weekly email digests.
If you are following multiple preprints then we will send you
no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint
and will receive updates in the daily or weekly email digests if turned on.
You can add specific subject areas through your profile settings.