High-performance pipeline for MutMap and QTL-seq

View article
Bioinformatics and Genomics

Introduction

Bulked segregant analysis (Michelmore, Paran & Kesseli, 1991; Giovannoni et al., 1991; Li & Xu, 2021), as implemented in MutMap (Abe et al., 2012) and QTL-seq (Takagi et al., 2013), is a powerful and efficient method to identify loci contributing to important phenotypic traits. MutMap requires whole-genome resequencing of a single individual from the original cultivar and the pooled sequences of F2 progeny from a cross between the original cultivar and mutant. MutMap uses the sequence of the original cultivar to polarize the site frequencies of neighboring markers and identifies loci with an unexpected site frequency, simulating the genotype of F2 progeny.

QTL-seq was adapted from MutMap to identify single or multiple loci contributing to important phenotypic traits. It utilizes sequences pooled from two segregating progeny populations with extreme opposite traits (e.g., resistant vs. susceptible to a pathogen) and single whole-genome resequencing of either of the parental cultivars. The original QTL-seq algorithm assumes that loci controlling phenotypic traits fix in opposite directions in two bulked populations through self-fertilizing. Therefore, QTL-seq is usually applied to homozygous genomes of the self-fertilizing plant but not to heterozygous genomes obligated to outcross.

Despite their usefulness, these programs are not user-friendly to install or run and require multiple user inputs. Another problem is that the programs requires Coval (Kosugi et al., 2013) for variant calling, which relies on the older versions of SAMtools (before 0.1.8). Updated software including PyBSASeq (Zhang & Panthee, 2020) and QTL-seqr (Mansfeld & Grumet, 2018) have been developed (Li & Xu, 2021).

In this study, we describe newly developed pipelines for MutMap and QTL-seq with updated features.

Implementation

The new pipelines support read trimming by Trimmomatic (Bolger, Lohse & Usadel, 2014), replacing fastx-toolkit in the previous pipeline. Trimmed reads are aligned by BWA-MEM (Li & Durbin, 2009), replacing BWA-SAMPE, BWA-ALN and Coval. Improperly paired reads and PCR duplicates are filtered by SAMtools (Li et al., 2009). Subsequently, a VCF file is generated by the “mpileup” command implemented in BCFtools (Li, 2011). The user can start the analysis from any point in the process, e.g., from raw FASTQs, trimmed FASTQs, BAM files, or a VCF file. MutPlot and QTL-plot, which are standalone programs, were developed for postprocessing of VCF files. Low-quality variants in a VCF file are filtered out based on mapping quality and strand bias and the actual and expected SNP-indices calculated based on the AD (allele depth) value of each sample pool (Abe et al., 2012). In QTL-seq, a ΔSNP-index is calculated by subtracting the SNP-index of one bulk from the other (Takagi et al., 2013). As an option, multiple testing correction (Huang et al., 2020) was also adopted to the simulation. Both pipelines ignore the SNPs which are missing in the parental sample. Candidate causal mutations in the VCF file are shown graphically after optionally executing SnpEff (Cingolani et al., 2012), which assesses the impact of located mutations on putatively expressed genes. The procedures are connected by a Python script.

Methods

To compare the performance of the new and old pipelines, we ran MutMap and QTL-seq on an AMD EPYC 7501 processor (Base 2.0 GHz) with 48 GB RAM and 12 threads (located at ROIS National Institute of Genetics in Japan). MutMap was run for two datasets, dataset 1 and dataset 2, used in the previous research (Abe et al., 2012). The original rice cultivar of both datasets was Hitomebore. The mutant bulks for dataset 1 and dataset 2 were Hit1917-pl and Hit1917-sd, respectively. These datasets can be downloaded as DRR004451 (Hitomebore), DRR001785 (Hit1917-pl), and DRR001787 (Hit1917-sd). MutMap v2.3.2 was run with the option “-n 20” as both mutant bulks contain 20 lines. The other parameters of MutMap v2.3.2 were set as default. For both datasets, “IRGSP-1.0” was used as the reference genome.

QTL-seq was run for the two datasets, dataset 3 and dataset 4, used in the previous study (Takagi et al., 2013). Dataset 3 was obtained from recombinant inbred lines (RILs) derived from a cross between Hitomebore and Nortai. Dataset 4 was obtained from F2 progeny derived from a cross between Hitomebore and WRC57. We used a rice cultivar Hitomebore as a parental sequence for both datasets. These datasets can be downloaded as DRR004451 (Hitomebore), DRR003237 and DRR003238 (RILs derived from F1 between Hitomebore and Nortai), and DRR003341 and DRR003342 (F2 progeny derived from F1 between Hitomebore and WRC57). For dataset 3, we ran QTL-seq v2.2.2 with the options “-n1 20 -n2 20 -F 6” because both bulks contain 20 F6 RILs. For dataset 4, we ran QTL-seq v2.2.2 with the option “-n1 50 -n2 50 -F 2” as both bulks contain 50 F2 progeny. The remaining parameters of QTL-seq v2.2.2 were set to their default values. For both datasets, “IRGSP-1.0” was used as the reference genome.

Pipeline workflow and performance of MutMap and QTL-seq.

Figure 1: Pipeline workflow and performance of MutMap and QTL-seq.

(A) The pipeline workflow of MutMap. (B) The pipeline workflow of QTL-seq. (C) Speed comparison between the new (v2.3.2) and old (v1.4.5) pipelines of MutMap. (D) Speed comparison between the new (v2.2.2) and old (v1.4.5) pipelines of QTL-seq. The values are mean ± SD (n = 3).

Results and Conclusions

The new MutMap and QTL-seq pipelines are approximately 5–8 times faster than the previous pipelines. The significantly reduced processing time of the updated pipelines was accomplished by utilizing more applications with parallel processing (Trimmomatic, SAMtools, and BCFtools) for different steps including SNP calling and by omitting the previously implemented creation of a consensus FASTA file (Fig. 1). The ability of the updated pipeline to use a wider range of input file formats reduced the time required for file-management and data handling and makes the software easier to use. Further time-savings were accomplished with the new pipeline by removing user interactions that were required in the previous version. Although the number of SNPs plotted was slightly different, the results of the old version and the new version were similar or had slightly better confidence index values (Fig. S1).

The simulation-based statistical test was adopted as the default because it allows addressing substantial heterogeneity in read depth among SNPs without any assumptions of statistical distributions of SNP-indices. We also implemented multiple testing correction following the parameters in the previous research (Huang et al., 2020). However, the method described in Huang et al. (2020) requires biological information such as number of chromosomes, genome size, and total centimorgan, which are not available in the majority of organisms, hence severely restricting the applicability. As stated by Li & Xu (2021), the role of bulked segregant analysis is to map the target QTLs as a primary test, regardless of the statistical threshold criteria.

Currently, these new pipelines can be installed through bioconda with all dependencies. The new pipelines of MutMap and QTL-seq have improved performance and are more user-friendly to install and run, making them very useful for the purpose of genetics studies.

Supplemental Information

Comparison of the results between the new and old pipelines for MutMap and QTL-seq

(A) MutMap plot of Hit1917-pl from MutMap v1.4.5. (B) MutMap plot of Hit1917-pl from MutMap v2.3.2. (C) MutMap plot of Hit1917-sd from MutMap v1.4.5. (D) MutMap plot of Hit1917-sd from MutMap v2.3.2. (E) QTL-seq plot of RILs from QTL-seq v1.4.5. (F) QTL-seq plot of RILs from QTL-seq v2.2.2. (G) QTL-seq plot of F2 progeny from QTL-seq v1.4.5. (H) QTL-seq plot of F2 progeny from QTL-seq v2.2.2.

DOI: 10.7717/peerj.13170/supp-1
25 Citations   Views   Downloads