ScanFold 2.0: a rapid approach for identifying potential structured RNA targets in genomes and transcriptomes

View article
Bioinformatics and Genomics

Main article text

 

Introduction

Methods

TensorFlow training of z-score model

Updates to ScanFold 2.0 and integration in the webserver

Testing of ScanFold 2.0 vs ScanFold 1.0

ROC curve analysis

Results and discussion

Comparing time and accuracy of ScanFold 2.0 vs ScanFold 1.0

Mono vs Di nucleotide shuffling of ScanFold 2.0

ROC analysis of SARS-CoV-2

Selection of ScanFold parameters

Conclusion

Supplemental Information

Comparison of SF1 and SF2 ct files.

The results from all CT files comparisons for HIV-1, ZIKA, and SARS-CoV-2. These comparisons include SF1 to SF2, SF1 to SF1, and SF2 to SF2 using both shuffling methods and 100, 1,000, and 10,000 randomizations. Additionally, the results are summarized for each genome in a separate tab of the file, which includes all the raw comparisons, tables, and calculated differences in percent of paired nucleotides for all conditions, and SF1 and SF2 run time tables and bar charts.

DOI: 10.7717/peerj.14361/supp-1

SF2 python notebook training code.

All training code for SF2 that was ran in Google Colab. This Python notebook can used to view and run all training code.

DOI: 10.7717/peerj.14361/supp-2

ROC analysis of all SF1 and SF2 results.

The results of the ROC analysis displayed in Figure 2. This includes all tpr, fpr, and AUC calculations for every dataset and all SF1 and 2 conditions used. Both of the ROC curve plots can also be found at the end of the document.

DOI: 10.7717/peerj.14361/supp-3

All SF1 and SF2 per window z-scores.

The scanning “.out” output data from all SF1 and SF2 run conditions for HIV-1, ZIKA, and SARS-CoV-2. All of the per window z-score data from each of the .out results were added to separate tabs genome tabs and average values were calculated. These values can be found in the last 4 tabs of the excel document.

DOI: 10.7717/peerj.14361/supp-4

Percent of nucleotides paired and percent similarity in pairing comparing SF1 to SF1 and SF2 to SF2.

The percent of nucleotides paired and the percent similarity in pairing from the comparison of SF1 to SF1 and SF2 to SF2 using both shuffling methods as well as 100, 1,000, and 10,000 randomizations for SF1. All plots on the left side are percent paired and all plots on the right are percent similarity in pairings. From top to bottom (A-C), the plots show results for SARS-CoV-2, ZIKA, and HIV-1. All plots are organized as follows from left to right: SF1 mono- to di- 100 randomizations for -2 z-score cutoff, SF1 mono- to di- 1000 randomizations for -2 z-score cutoff, SF1 mono- to di- 10000 randomizations for -2 z-score cutoff, SF2 mono- to di- for -2 z-score cutoff, SF1 mono- to di- 100 randomizations for -1 z-score cutoff, SF1 mono- to di- 1000 randomizations for -1 z-score cutoff, SF1 mono- to di- 10000 randomizations for -1 z-score cutoff, SF1 mono- to di- for -1 z-score cutoff, SF1 mono- to di- 100 randomizations for no filter z-score cutoff, SF1 mono- to di- 1000 randomizations for no filter z-score cutoff, SF1 mono- to di- 10000 randomizations for no filter z-score cutoff, and SF2 mono- to di- for no filter z-score cutoff.

DOI: 10.7717/peerj.14361/supp-5

Average SF1 and SF2 per window z-scores for all analyzed genomes.

Box and whisker plot of the average window z-score for all genomes analyzed. The plots show that in general mononucleotide shuffling produces lower z-scores, dinucleotide shuffling produces higher z-scores, and SF2 mono- and dinucleotide shuffling produce more similar z-scores than SF1 mono- and dinucleotide shuffling. From top to bottom (A-C) the plots represent ZIKA, HIV, and SARS-CoV-2. Each plot is organized in the same way. From left to right SF1 100 randomizations with mono-, SF1 100 randomizations with di-, SF1 1000 randomizations with mono-, SF1 1000 randomizations with di-, SF1 10000 randomizations with mono-, SF1 10000 randomizations with di-, SF2 with mono-, and SF2 with di-.

DOI: 10.7717/peerj.14361/supp-6

Percent difference in percent of predicted paried nucleotides comparing SF1 to SF1 mono and dinucleotide shuffling and SF2 to SF2 mono and dinucleotide shuffling.

Each cell in the table is the percent difference in percent of predicted nucleotides that are paired. Positive values indicate mono- predicted more pairs and negative values indicate that di- predicted more pairs for the respective genome, version of SF, randomizations, and shuffling method. From left to right SF1 100 randomization difference in percent paired for -2 z-score pairs, SF1 1000 randomization difference in percent paired for -2 z-score pairs, SF1 10000 randomization difference in percent paired for -2 z-score pairs, SF2 difference in percent paired for -2 z-score pairs, SF1 100 randomization difference in percent paired for -1 z-score pairs, SF1 1000 randomization difference in percent paired for -1 z-score pairs, SF1 10000 randomization difference in percent paired for -1 z-score pairs, SF2 difference in percent paired for -1 z-score pairs, SF1 100 randomization difference in percent paired for no filter z-score pairs,SF1 1000 randomization difference in percent paired for no filter z-score pairs, SF1 10000 randomization difference in percent paired for no filter z-score pairs, and SF2 difference in percent paired for no filter z-score pairs. From top to bottom ZIKA, HIV-1, SARS-CoV-2 genomes that coincide with the differences in percent paired nucleotides.

DOI: 10.7717/peerj.14361/supp-7

Average z-scores from all ScanFold analysis windows for SF1 and SF2 using both shuffling methods and different randomizations.

The average z-scores for all windows in every ScanFold run were determined and the average difference between shuffling techniques was calculated from all analyzed genomes. From left to right SF1 using 100 randomization mono-, SF1 using 100 randomizations di-, difference between SF1 using 100 randomizations with mono- and dinucleotide, SF1 using 1000 randomizations with mono-, SF1 using 1000 randomizations with di-, difference between SF1 using 1000 randomizations with mono- and dinucleotide, SF1 using 10000 randomizations with mono, SF1 using 10000 randomizations with di-, difference between SF1 using 10000 randomizations with mono- and dinucleotide, SF2 mono-, SF2 di-, and difference between SF2 with mono- and dinucleotide. From top to bottom ZIKA, HIV, SARS-CoV-2, and the average difference in z-score between shuffling techniques.

DOI: 10.7717/peerj.14361/supp-8

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Ryan J. Andrews conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Warren B. Rouse conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Collin A. O’Leary analyzed the data, prepared figures and/or tables, and approved the final draft.

Nicholas J. Booher performed the experiments, prepared figures and/or tables, and approved the final draft.

Walter N. Moss conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data and code are available in the Supplemental Files.

Funding

This research was supported by National Institute of General Medical Sciences R01GM133810 to Walter N. Moss and National Cancer Institute F31CA257090 to Warren B. Rouse. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

14 Citations 1,490 Views 134 Downloads