VGEA: an RNA viral assembly toolkit

View article
Bioinformatics and Genomics

Introduction

The most abundant biological entities on Earth are viruses as they can be found among all cellular forms of life. So far, over four thousand five hundred viral species have been discovered, from which a huge amount of sequence information has been collected by researchers and scientists all over the world (Pickett et al., 2012; Sharma, Priyadarshini & Vrati, 2015; Brister et al., 2015). In recent times (past two decades), a number of these viruses have emerged in the human population causing disease outbreaks and sometimes pandemics. These viruses include mainly: Influenza virus, Severe Acute Respiratory Syndrome (SARS) coronavirus, Middle East Respiratory Syndrome (MERS) coronavirus, Ebola virus, Yellow fever virus, Lassa virus (LASV), Zika virus (Chan, 2002; Bean et al., 2013; Folarin et al., 2016; Grubaugh et al., 2017; Metsky et al., 2017; Siddle et al., 2018; Ajogbasile et al., 2020) and SARS-CoV-2 (Chen et al., 2020; Holshue et al., 2020; Sohrabi et al., 2020). During these outbreaks and pandemics, genomic sequencing for identification and characterization of the transmission and evolution of the causative agents have proved to be critical in helping inform disease surveillance and epidemiology.

Next Generation Sequencing (NGS) platforms have been widely accepted as high-throughput, open view technologies that have many attractive features for virus detection and assembly (Tang & Chiu, 2010; Mokili, Rohwer & Dutilh, 2012). NGS-based studies have vastly increased our understanding of viral diversity (Reyes et al., 2010; Cantalupo et al., 2011). Pathogen sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. The field of genomics, especially pathogen genomics has been transformed by NGS, with costs constantly decreasing, equipment becoming more portable/field deployable during outbreaks and remarkable increase in data availability.

The huge amount of data being generated requires various processing steps such as removal of primers and adapters, quality filtering and control which is usually crucial for various downstream analysis. Several tools have been developed for these purposes, such as fastp (Chen et al., 2018) and Trimmomatic (Bolger, Lohse & Usadel, 2014).

Reconstructing viral genomes from NGS data is usually achieved through de novo assembly (which is the process of assembling genomes using overlapping sequencing reads), or through a reference-guided approach (which involves mapping sequence reads to a reference genome). Numerous tools have been developed for these purposes; SPAdes (Bankevich et al., 2012), Burrows-Wheeler Alignment tool (BWA), V-GAP (Nakamura et al., 2016), VirusTAP (Yamashita, Sekizuka & Kuroda, 2016), V-Pipe (Posada-Céspedes et al., 2021) and viral-ngs (https://github.com/broadinstitute/viral-ngs), amongst others. Contigs generated by de novo assembly however do not provide a complete summary of reads, misassembly can result in the contigs having an incorrect structure, and for parts of the genome where contigs could not be assembled, no information is available. In addition, reference-guided assembly of viral genomes can lead to biased loss of information which can then skew epidemiological and evolutionary conclusions (Wymant et al., 2018).

Variant analysis and genome quality assessment to detect variants and changes occurring across the genome of a virus is also a key step in viral genome analysis as viruses (especially RNA viruses) are known to have high mutation rates (Duffy, 2018). Variant analysis is important for detecting outbreak origins and for phylogenetic/phylogeographic studies and best practices for variant identification in microbial genomes have been proposed in literature and adopted to a large extent (Van der Auwera et al., 2013).

A number of pipelines that have been developed for downstream analysis of viral genomes require high performance computing (HPC) clusters and/or cloud-based systems e.g., the V-pipe authors recommend running V-pipe on clusters because for most applications, running V-pipe on a local machine may not be efficient (https://github.com/cbg-ethz/V-pipe/wiki/advanced) and some of these pipelines are only web-based such as VirAmp (Wan et al., 2015) and VirusTAP (Yamashita, Sekizuka & Kuroda, 2016. Also, some pipelines have many dependencies to be installed especially if the analysis requires multiple tasks to be performed. In low-and-middle income countries (LMICs) where most scientists do not have access to HPC clusters or cloud-based systems and where internet connection is too unstable to regularly make use of web-based platforms for analysis, this can be a daunting task.

The challenges listed above motivated the development of VGEA (Viral Genomes Easily Analyzed, available online at https://github.com/pauloluniyi/VGEA). VGEA makes use of existing bioinformatics pipeline/tools to carry out various viral genome analysis tasks and is built on an advanced workflow management system, Snakemake (Köster & Rahmann, 2012).

Materials and Methods

Datasets

We successfully tested and validated VGEA with SARS-CoV-2 (n = 20) and Lassa Virus (n = 20) datasets sequenced on the illumina MiSeq and illumina FGx sequencing machines in our laboratory at the African Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University, Ede, Nigeria. Briefly, samples were inactivated in buffer AVL and viral RNA was extracted according to the QiAmp viral RNA mini kit (Qiagen) manufacturer’s instructions. Extracted RNA was treated with Turbo DNase to remove contaminating DNA, followed by cDNA synthesis with random hexamers. Sequencing libraries were prepared using the Nextera XT kit (Illumina) as previously described (Matranga et al., 2016) and sequenced on the Illumina Miseq platform with 101 base pair paired-end reads. We also tested and validated VGEA with HIV-1 datasets sequenced on the illumina HiSeq 2500 obtained from NCBI Sequence Read Archive (SRA). We made use of 60 test datasets (Lassa Virus (20), SARS-CoV-2 (20) and HIV-1 (20)) for the validation of the VGEA pipeline. All our test datasets are available on figshare (https://doi.org/10.6084/m9.figshare.13009997).

Implementation

The installation of VGEA requires the pipeline to be downloaded onto a personal computer and creation of a conda environment to set up all dependencies. Complete installation steps are in the github README file: https://github.com/pauloluniyi/VGEA/blob/master/README.md

The analysis of VGEA is broken down into a set of ‘rules’ that links the output file of an analysis into the input of the next task in the general workflow (Fig. 1). The dependencies are fastp for read trimming and read-level quality control, BWA for mapping sequencing reads to the human reference genome, SAMtools for extracting unmapped reads and also for splitting bam files into fastq files, IVA for de novo assembly to generate contigs, shiver to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit for cleaning shiver assembly for QUAST, QUAST to evaluate/assess the quality of genome assemblies and MultiQC for aggregation of the results from fastp, BWA and QUAST

A schematic workflow of VGEA.

Figure 1: A schematic workflow of VGEA.

User-supplied paired-end fastq files are pre-processed and trimmed using FASTP followed by mapping to the human reference genome with BWA. Following mapping, a BAM file containing unaligned/unmapped reads is extracted using SAMTOOLS. This BAM file is then split into fastq files of forward and reverse reads also with SAMTOOLS after which de novo assembly is carried out using IVA. Following de novo assembly, SHIVER is used to map the reads and generate consensus sequences, and detailed minority variant information (full explanation of the shiver method is in File S1). SEQKIT is used to clean the SHIVER output for QUAST after which genome evaluation and assessment is carried out using QUAST. MULTIQC is then used for aggregation of results from BWA, FASTP and QUAST.

All of these tools can be installed using a bioconda channel (Grüning et al., 2018). The input files for VGEA are paired-end fastq files. VGEA allows full customization of the pipeline, so users can modify the parameters used in running their samples. It is possible to modify every step of the workflow to suit the samples being processed. Users can also add more steps to the pipeline as they see fit. The pipeline runs on Linux/Unix and Mac. However, no prior programming is required to run the pipeline and, once the user supplies the input, the whole workflow can run automatically from beginning to end.

Results

VGEA carries out read trimming and quality control tasks on input FASTQ data using fastp (Fig. 2). This increases the quality of data used for subsequent steps of the pipeline.

Fastp pre-processing report for a SARS-CoV-2 test dataset analyzed using VGEA.

Figure 2: Fastp pre-processing report for a SARS-CoV-2 test dataset analyzed using VGEA.

VGEA then maps reads to the human reference genome in order to remove human contaminants, the pipeline carries out this step using BWA. Genome assembly and consensus sequence generation is carried out, together with the generation of summary minority-variant information (base frequencies at each position) and detailed minority-variant information (all reads aligned to their correct position in the genome). VGEA carries out assembly using IVA and generates consensus sequences using shiver. Previous study by the shiver developers has shown the systematic superiority of mapping to shiver’s constructed reference compared with mapping the same reads to the closest of 3,249 references: median values of 13 bases called differently and more accurately, zero bases called differently and less accurately, and 205 bases of missing sequence recovered (Wymant et al., 2018).

VGEA also assesses the quality of genome assemblies using QUAST. QUAST evaluates metrics such as contig sizes, misassemblies and structural variations, genome representation and its functional elements, variations of N50 based on aligned blocks and then presents these statistics in graphical form. QUAST also makes a histogram of several metrics including the number of complete genes, operons and the genome fraction (%). Finally, VGEA compiles the results of BWA, fastp and QUAST into a single MultiQC report (Fig. 3).

MultiQC report of five SARS-CoV-2 datasets analyzed using VGEA.

Figure 3: MultiQC report of five SARS-CoV-2 datasets analyzed using VGEA.

Performance evaluation

VGEA makes use of Snakemake’s benchmarking feature which allows the measurement of the CPU usage and wall clock time of each rule in the pipeline. This allows the user to know which step of the pipeline requires the least and highest amount of computational resources. Knowledge of this can help the user decide on the number of threads to dedicate to each rule as VGEA also makes use of Snakemake’s multi-threading feature. Table 1 shows the benchmarking values for a sample SARS-CoV-2 dataset analyzed using VGEA.

Table 1:
Benchmarking values (time and CPU usage) for a SARS-CoV-2 dataset analyzed using VGEA.
VGEA rule name Time (h:m:s) Maximum RAM used (MB)
human_reference_index 1:01:53 4688.56
fastp 0:00:14 581.91
bwa_human 0:08:52 5960.95
samtools_extract 0:02:40 16.21
bamtofastq 0:01:39 6.61
aiva 8:19:11 238.57
shiver_init 0:00:53 64.97
shiver_align_contigs 0:04:37 2509.64
shiver_map_reads 0:31:51 567.27
shiver_tidy 0:00:00 1.06
quast 0:00:33 72.51
DOI: 10.7717/peerj.12129/table-1

Notes:

IVA was run using one CPU core and two threads so if allowed more computational resources, the assembly time will be even shorter.

We compared the contigs generated by VGEA’s assembly step with contigs generated using two other standalone and commonly used assembly pipelines, SPAdes (Bankevich et al., 2012) and Velvet (Zerbino & Birney, 2008). We compared against these two pipelines because most commonly used assembly workflows like viral-ngs and VirAmp are built on them. We carried out this comparison by making use of five different SARS-CoV-2 test datasets (namely CV18, CV29, CV45, CV115 and CV145 datasets available on FigShare and NCBI). We compared the assemblies to the SARS-CoV-2 reference genome, and N50/NG50, mis-assembly, mismatches and indel scores were used to evaluate the performance of each assembly method as recommended by Assemblathon 2 (Bradnam et al., 2013) (Table 2). Basic statistics were calculated using QUAST. All results of our performance evaluation and comparison are provided as File S2. All analyses were run on a 64-bit personal computer with 16GB RAM using four threads. SPAdes version 3.15.2 and Velvet version 1.2.10 were used for the comparison purposes using the default parameters.

Table 2:
Performance comparison using different assembly pipelines.
Sample ID # reads
(x106)
Pipeline # contigs Largest contig (bp) N50 NG50 Genome fraction (%) Mis assemblies Mismatches Indels Maximum RAM used (MB)
CV18 3.2 VGEA
SPAdes
Velvet
42
384
68
29928
22141
1858
2294
1435
728
29928
22141
922
99.776
99.652
19.326
0
1
0
10
18
3
0
1
0
627
2447
1544
CV29 1.8 VGEA
SPAdes
Velvet
31
478
66
7731
24904
2877
3065
1136
942
7534
24904
1380
99.786
99.632
1.729
0
0
0
9
7
0
0
0
0
484
2314
807
CV45 6.2 VGEA
SPAdes
Velvet
30
45
535
16248
6779
5239
2603
1255
898
16248
2447
3030
98.291
94.957
14.256
1
0
0
11
35
0
0
12
0
666
2504
1360
CV115 2 VGEA
aSPAdes
Velvet
28
49
41
5225
1942
2847
2258
1068
819
3060
1828
931
96.957
-
68.134
0

0
12

9
0

0
177
1735
511
CV145 4.4 VGEA
SPAdes
Velvet
28
188
178
6807
3216
1798
2049
1190
682
4214
2477
1107
73.093
5.073
3.578
0
2
0
14
13
0
0
0
0
635
2547
1459
DOI: 10.7717/peerj.12129/table-2

Notes:

QUAST gave no genome fraction value for this sample.

Evaluation statistics showed that contigs generated by VGEA had the highest NG50 score for four of the five datasets and the highest N50 scores across all five datasets. In all five datasets, VGEA’s contigs had the highest genome fraction covering greater than 95% in four.

Comparison of maximum RAM used by VGEA, SPAdes and Velvet showed that VGEA used the least amount of RAM for the analyses of all five datasets used for comparison. SPAdes and Velvet however ran faster than VGEA for all analyses.

Discussion

VGEA is built on the snakemake workflow management system (BKöster & Rahmann, 2012), a workflow management system that allows the effortless deployment and execution of complex distributed computational workflows in any UNIX-based system, from local machines to high-performance computing clusters. It is a user-friendly, customizable and reproducible pipeline which can be deployed on a personal computer and which can run from start to finish with a single command.

VGEA was designed with ease-of-use in mind and so all its dependencies can be installed in a conda environment under the bioconda channel (Grüning et al., 2018 making it particularly useful for scientists with little or no computational background and for scientists in LMICs who don’t have much access to high-performance computing clusters or cloud-computing resources. VGEA capitalizes on Snakemake’s multi-threading feature so that makes it possible for it to be deployed on laptops with greater computing performance or a computing server to improve its speed. The pipeline was tested with paired-end short-read sequencing data produced by the illumina platform (MiSeq, MiSeq FGx and HiSeq 2500).

The results generated by the major steps of the VGEA pipeline are summed up together into a MultiQC report which can be easily interpreted and understood by anyone with little or no knowledge of bioinformatics.

Conclusion

VGEA was built primarily by biologists and in a manner that is easy to be employed by users without significant computational background. As new and innovative tools for viral genome analysis and assembly are increasingly being developed, these can easily be incorporated into the VGEA pipeline. We hope that other scientists can build upon and improve VGEA as a tool to extract more qualitative and quantitative information from viral genomes.

Supplemental Information

The shiver Method in More Detail (from (Wymant et al., 2018)

DOI: 10.7717/peerj.12129/supp-1

Performance Evaluation and Assembly Pipelines Comparison Data

DOI: 10.7717/peerj.12129/supp-2
2 Citations   Views   Downloads