Recently, multiple-trait gene mapping has been widely studied, which refers to analyzing multiple quantitative/qualitative traits simultaneously, considering the correlations among multiple traits. It is necessary to consider the correlations among multiple traits in the process of quantitative trait locus (QTL) mapping because the joint information of multiple traits can improve the accuracy of detecting QTLs that influence the traits of interest. Multiple-trait methods have been validated to be more powerful than single-trait methods (Zhu & Zhang, 2009), and they are being used in many fields such as complex trait analysis and animal or plant genetic breeding.
Genome-wide association studies (GWAS) have become a powerful strategy for exploring candidate genes for important traits, including human complex diseases (Li & Leal, 2008; Maity, Sullivan & Tzeng, 2012; Sainani, 2015; Shi et al., 2017; Storey & Tibshirani, 2003; Wang et al., 2016; Zhang et al., 2017). Currently, several methods for detecting associations between multiple traits and common/rare variants have been proposed. For instance, Zhu, Jiang & Zhang (2012) presented a covariate-adjusted association method based on generalized Kendall’ tau, Zhu & Zhang (2013) proposed a nonparametric regression method for multiple longitudinal phenotypes using multivariate adaptive splines, Zhou et al. (2016) developed a nonparametric method to test for associations between rare variants and multiple traits, Kwak & Pan (2017) proposed gene-and pathway-based association tests for multiple traits with GWAS summary statistics, and Gai & Eskin (2018) presented a meta-analysis method for multiple-trait association analysis.
In addition, linkage analysis methods, such as interval mapping (IM), composite interval mapping (CIM), and multiple-interval mapping (MIM), have also been used to detect QTLs that control single/multiple traits. Linkage mapping methods have their own advantages in cases of both single and multiple traits and different mapping methods depend on their own statistical models. Unfortunately, however, the linkage analysis method for a single trait cannot accurately map the genes that have multiple effects. Jiang & Zeng (1995) proposed the MT-CIM method based on a mixed linear model and maximum likelihood method. This method enables the simultaneous mapping of multiple traits using composite interval mapping. Compared with the single-trait QTL mapping method, it can improve the accuracy and efficiency of QTL mapping. Based on the idea of MIM, using the MT-CIM model, Joehanes (2009) proposed a multi-trait and multi-interval mapping method abbreviated MTMIM and showed that the accuracy of MTMIM is higher than that of MIM and MT-CIM. Recently, Tong, Sun & Zhou (2018) proposed a method (MTMIM-NEW) for simultaneously estimating QTL parameters for mapping multiple traits, which showed higher precision.
All the above-mentioned methods are applicable to cases where there are no genotype errors included in the practical data set. However, due to the constraints in genotype scoring software and biochemical anomalies, most of the data that researchers have used may contain certain potential genotype errors. The genotype errors cannot be neglected arbitrarily in statistical analysis and may have a significant impact on the study of genetic linkage analysis (Ronin et al., 2015; Tong et al., 2015; Yan et al., 2016).
To circumvent this difficulty, in this paper, we proposed a multi-trait multi-interval mapping method in the presence of genotype errors in the genetic data. The closed iteration formulas of QTL effects, QTL recombination rates, the covariance matrix, and the genotype error rate are given in the new method, which guarantees the precision of parameter estimation. To validate the feasibility of the new method, we analyzed mouse high-density lipoprotein cholesterol data with missing genotypes. The results show that the method can effectively solve the problem of incomplete marker genotypes caused by biological or physical deficiencies. Simulation results also show that the new method has advantages in estimating all parameters of interest when marker genotypes contain errors. In particular, compared with existing methods, the new method can provide higher precision in estimating QTL positions (recombination rates).
Theory and methods
We consider the data of n individuals from the F2 population with t phenotypic traits; q + 1 marker loci are closely linked to form q marker intervals. Assuming that Yjl (j = 1, …, n, l = 1, …, t) represents the lth phenotypic trait value of the jth individual, Xji (j = 1, …, n, i = 1, …, q + 1) and denote the true marker genotype and the observed marker genotype, possibly with error of the ith marker of the jth individual, respectively, denotes the latent QTL genotype within the ith marker interval of the jth individual. Let Yj = (Yj1, Yj2, …, Yjt)′, Xj = (Xj1, Xj2, …, Xj(q + 1))′, and . γi and γi1 respectively denote recombination rate of the ith marker interval which is known and the recombination rate between the ith marker and the latent QTL in the marker interval. Assuming that each marker interval of the F2 group has at most one QTL (Zhou, 2010), and when the genotype combination of the ith marker interval of the jth individual is known, the conditional probabilities of QTL genotype are presented in Table 1.
|Code||Marker genotype||QTL genotype|
|1||MiMiMi + 1Mi + 1||1||0||0|
|2||MiMiMi + 1mi + 1||1 − ri||ri||0|
|3||MiMimi + 1mi + 1||(1 − ri)2||2ri(1 − ri)|
|4||MimiMi + 1Mi + 1||ri||1 − ri||0|
|5||MimiMi + 1mi + 1||0||1||0|
|6||Mimimi + 1mi + 1||0||1 − ri||ri|
|7||mimiMi + 1Mi + 1||2ri(1 − ri)||(1 − ri)2|
|8||mimiMi + 1mi + 1||0||ri||1 − ri|
|9||mimimi + 1mi + 1||0||0||1|
ri = γi1/γi.
Here, we assume that parameter denotes the parameter of the genotype error rate for any genotype x, and further assume that φj denotes the joint error rate of the jth individual. During each step of the iteration computation in our new method, when the true marker genotype Xj is given, we can calculate the number kj of false genes coded at the q + 1 marker loci. Assuming that whether a marker genotype has a genotyping error or not is independent of other marker genotypes, we can obtain
The joint error rates φj(j = 1, …, n) can be used to infer the parameter θ of the genotype error rate in the new method.
In the framework of multiple-interval mapping, to solve the problem of multiple-trait analysis with genotype errors, we consider building a statistical model as follows:
(1) where is the phenotype value matrix with element Yji, and denotes the indicator vector of the ith QTL genotypes of n individuals, respectively. In detail,
and denote the additive effect vector and dominant effect vector of the ith QTL on t traits, respectively.
Furthermore, let the QTL effect matrix be
represents the residual matrix, , and E(ej) = 0, cov(ej) = e. For convenience, we denote the covariance matrix as:
where , for any individual, j.
To solve the problem of multiple-trait, multiple-interval mapping with genotype errors, based on Model (1), we next provide a detailed derivation process of the new method via the classical expectation-maximization (EM) algorithm (Dempster, Laird & Rubin, 1977). Let Ω = (C, ∑e, γ, θ) denote all parameters of interest, where γ = (γ11, γ21, …, γq1) is the parameter vector of QTL positions. The new method has the advantage of simultaneously estimating all model parameters in the framework of multiple-interval mapping.
Parameter estimation via the EM algorithm
In our new method, we focus on the observed data , where , Y = (Y1,Y2, …., Yn)′, and deeply mine the generating mechanism of . At the same time, we need to comply with the statistical model consisting of multiple traits and the latent QTLs given in Model (1). Therefore, based on the logistic structure of all random variables, as well as the background of genetic linkage, the EM algorithm can be applied to infer all parameters of interest here.
The complete log-likelihood function of the parameter matrix Ω can be expressed as follows:
E−Step: Given the observed data and the sth-step parameter values (s), we calculate the conditional expectation of l( ).
Here, , , and . According to the Bayes formula, can be calculated and expressed in the following form:
Further, we let
(3) where represents the conditional probability that takes the kth value under the given condition of . In a detailed analysis, , which is a multivariate normal density with mean μjk and covariance matrix e. can be expressed by a function of conditional probabilities listed in Table 1, because each QTL genotype is conditionally independent, given the condition of marker genotypes. In detail, is the function of recombination rates γ11, γ21, …, and γq1.
M−Step: Calculate the maximum value of the conditional expectation , in order to obtain Ω(s + 1).
Updating C and e
Based on the first part of in Eq. (2), the iteration expression of the QTL effect matrix C and covariance matrix e can be obtained as follows:
where is a n × 3q matrix with element given in Eq. (3), G is the QTL genotype sequence, and D = (D1, D2, D3, …, Dcq) is a 3q × cq matrix with expression
The formulae of R(s) and M(s) can be expressed as
where notation “#” in expressions R(s) and M(s) denotes the Hadamard product of two vectors and 1 is an n × 1 column vector of ones.
We define that the indicator function, , of the QTL genotype and the marker-interval genotype have the following form:
Here, j = 1, …, n, i = 1, …, q. According to the conditional independence assumption of QTL genotypes, given marker-interval genotypes, the second part of in Eq. (2) can be transformed into the following form:
Hereinto, can be expressed as a function of the probabilities listed in Table 1 and the indicator function Ist(ji)s. When we maximize the above Eq. (4), an explicit closed expression of the recombination rate γi1 can be obtained:
As for the third part of in Eq. (2),
and maximization of this function could lead to an explicit expression of the error rate, θ:
By repeating the above entire updating process for C, e, γ, and θ until convergence, the parameter estimate of can eventually be obtained. For convenience, this proposed method for mapping QTLs of multiple traits is referred to as MTMIM_e.
In our study, we suppose that the overall genotype error rate is in a given data set, which means that the genotype error rates for each locus are same. When considering multiple datasets with same error rate (e.g., the genotype data in each dataset were obtained under the same environment), we can combine these datasets into one big dataset. At this time, the total error rate is still the fixed parameter, so the combined dataset can be regarded as a new dataset, and then can be analyzed by the method provided in this paper. All parameters including the genotype error rate can be estimated, and moreover, the accuracy of parameter estimation will be improved due to the increase of sample size. If these datasets have different genotype error rates (e.g., θ1, θ2, θ3), we can also combine these data sets into one big dataset, but the estimating strategy in our method need to be adjusted. The parameters C, e, and γ can be estimated by the whole data, but the genotype error rates need to be estimated by their respective dataset.
To evaluate the performance of the proposed MTMIM_e method in this study and objectively compare it with the existing methods, MTMIM (Joehanes, 2009) and MTMIM-NEW (Tong, Sun & Zhou, 2018), extensive simulation studies were conducted in this section.
In our simulation design, two quantitative traits and two latent causal QTLs were considered. The two QTLs were respectively located in two equally spaced marker intervals that consisted of three markers on a chromosome. We randomly set certain genotype errors at the first marker in error rate θ. The length of the marker interval was taken as ten cM, and the sample size was N = 500. A map distance of ten cM corresponds to a recombination rate of 0.0906 according to the Haldane map function (Ott, 1999). To study the impact of genotype errors on the estimates of the recombination rate γ, the effect matrix C, and the covariance matrix ∑e, simulated data were generated under different error rates (θ = 0, 0.05, and 0.1); the true values of recombination rates were taken as γ11 = 0.03 and γ21 = 0.04, and the effect matrix C was selected to make the heritability h2 close to 0.05 and 0.2, respectively. For each group of parameters, we computed the estimates of all parameters using the above three methods. The entire process was repeated 1,000 times and the mean of the estimates for each parameter was computed. To evaluate the accuracy of the estimates obtained by the different methods, the mean square error (MSE) of each parameter estimate and the total mean (TM) of MSEs of all parameter estimates (except θ) were also calculated (Tables 2–4; Figs. 1, 2).
|True||= 0||= 0.05||= 0.1|
|Parameter||h2 = 0.05||h2 = 0.2||MTMIM||MTMIM-NEW||MTMIM_e||MTMIM||MTMIM-NEW||MTMIM_e|
|Heritability||Trait||QTL||Effect parameter||True value of error rate θ|
|h2 = 0.2||T1||Q1||−0.1478||0.1182||0.0106||0.0585||0.1013|
|h2 = 0.05||T1||Q1||−0.0678||0.0542||0.0306||0.0580||0.1019|
Table 2 provides the estimates of all parameters (except θ) and the MSEs with different error rates when heritability h2 = 0.2. Simulation results show that genotype error is an important factor affecting parameter estimation; larger genotype error rates would result in lower estimation accuracy. To see the performance of the three methods more intuitively, we present the TM histograms of the three methods (Fig. 1). The gray bar graph shows the TM values of the newly proposed method under different error rates. In the case of the same genotype error rate (θ = 0.05, 0.1), it can be seen from Fig. 1 that each value of the TM of the proposed method (MTMIM_e) is uniformly lower than the corresponding values of MTMIM-NEW and MTMIM. When the genotype error rate θ = 0 (i.e., no genotype error in the simulated data set), the performance of the new method is also comparable to that of the MTMIM-NEW method and better than that of the original MTMIM method. That is, the new method outperforms the other two methods in terms of estimating accuracy, and it can effectively overcome the influence of the genotype errors on gene mapping.
We also consider the impact of different heritabilities on the estimation accuracy. Table 3 gives the estimates of all parameters (except θ) and their MSEs, with different heritability values when the genotype error rate θ = 0.05. The simulation results showed that, with an increase in heritability, the accuracy of these three methods also increases as expected; however, among all these methods, the new method still has the highest estimation accuracy. Figure 2 shows the TM histogram of the three methods under different heritability (h2 = 0.05, 0.2) conditions. The TM value of the new method (MTMIM_e) is smaller than that of the other two methods; in other words, the new method can weaken the influence of genotype error on parameter estimation (Fig. 2).
In addition to the estimates of QTL effects, QTL positions, and the covariance matrix of multiple traits, the estimate of the genotype error rate can also be obtained simultaneously using the new method. Table 4 shows the estimates of error parameter θ and the corresponding MSEs with true values of genotype error rate θ = 0, 0.05, 0.1, when the heritability is h2 = 0.05, 0.2. The estimation accuracy of error rate θ increases with an increase in heritability h2, as expected, and the estimation accuracy decreases with an increase in the genotype error rate (Table 4). This is because simulated data with higher genotype error rates provide more uncertainty to the practical inference, relative to the cases with lower genotype errors or no genotype error.
In order to further evaluate the performance of the proposed method on analyzing multiple datasets, we conducted another simulation study. A total of three datasets with sample sizes N1 = 200, N2 = 300, and N3 = 500, and genotype error rates θ1 = 0.01, θ2 = 0.02, and θ3 = 0.05 were simulated under two heritabilities (see Table 5 for the true values of parameters), and the MTMIM_e method was used to estimate all parameters. The means of the estimates for each parameter over 1,000 replications and the corresponding MSEs were provided in Table 5. From the results, we conclude that the new method can effectively and simultaneously analyze multiple datasets with potential genotype errors.
|Effect parameter||Estimates||γ11||γ21||ρ||σ11||σ21||Error rate θ|
|Heritability||Trait||QTL||Additive||Dominant||Additive||Dominant||0.03||0.04||0.9||1||1||θ1 = 0.01||θ2 = 0.02||θ3 = 0.05|
|h2 = 0.2||T1||Q1||−0.1478||0.1182||−0.1457||0.1013||0.0269||0.0379||0.8844||0.9822||0.9822||0.0189||0.0301||0.0397|
|h2 = 0.05||T1||Q1||−0.0678||0.0542||−0.0725||0.0492||0.0300||0.0413||0.8798||0.9779||0.9788||0.0263||0.0378||0.0374|
Statistical analysis of real data
In this section, we further verified the feasibility of the proposed method in the mapping of multiple-trait loci, and an experimental data set on high-density lipoprotein cholesterol levels in mice in the published literature was used for the analysis (Stylianou et al., 2006). High-density lipoprotein (HDL) and body weight (BW) traits were selected for data analysis. We chose three candidate markers (D08.0989, D08.1099, and D08.1238) to perform multiple-interval mapping, which were located at 49.225, 54.685, and 61.605 cM of chromosome 8, respectively. These markers construct two marker intervals. In fact, there are missing values in the genotype data of the first marker of the considered dataset. For illustration, the missing values were imputed by the corresponding modes of genotypes on the locus, and then the new genotype data became observed data but with genotype errors.
The proposed method, MTMIM_e, in which genotype errors are considered, and the other two methods, MTMIM and MTMIM-NEW, in which genotype errors are neglected, were all used to deal with the imputed data set (Table 6). All the estimated results of QTL positions and effects obtained from the three methods show that QTL 1 located at ∼50 cM and QTL 2 located at ∼57 cM of chromosome 8 had significant effects on the two traits, HDL and BW. Moreover, QTL 1 and QTL 2 showed additive/dominant effects on the two traits in the same direction for all three methods. A small difference is that the MTMIM method gave a positive dominant effect estimate on QTL 1, but the other two methods obtained a negative estimate value of the dominant effect of this QTL. These conclusions are similar to those drawn by Stylianou et al. (2006); however, the estimated results of MTMIM-NEW and MTMIM_e seem much closer to those obtained by Stylianou et al. (2006). In addition, the proposed MTMIM_e can simultaneously estimate the genotype error rate θ of the imputed data set, and the estimated value is , which is close to the missing proportion of genotypes on the first marker locus.
In this paper, a new multiple-trait, multiple-interval mapping method (MTMIM_e) was proposed to deal with multiple-trait genetic data with potential genotype errors, owing to the fact that most genetic datasets may contain certain genotype measurement errors. In fact, the proposed method can also be used to analyze a dataset that contains missing values. In this case, one needs to impute the missing values by statistical methods first, and then consider the imputed missing values as data with genotyping errors, so that the MTMIM_e can be applied, as shown in our analysis of a real example. Extensive simulation studies have validated that the new method is advantageous for parameter estimation in the QTL mapping of multiple traits and it can effectively overcome the impact of genotype errors/loss and estimate all parameters simultaneously, compared with existing methods. The proposed method can also be used to deal with the problems of other cases, for example, other experimental populations, or different error rates on different loci, in which we only need to change the conditional probabilities provided in Table 1, or adjust the presentation of the joint error rate φj (Tong et al., 2015), respectively. The statistical Model (1) considered in this study can also be further generalized; for example, genotype information of markers can be added and considered simultaneously, which will lead to a composite multi-interval mapping model.
After obtaining an estimate of Ω, we can further consider whether there are significant QTLs in the considered marker intervals that control multiple traits. A global test using the likelihood-ratio (or log10 of odds; LOD score) statistic can be performed first to test two hypotheses of the effect matrix C, i.e., H0:C = 0 vs. and furthermore, a local test can be conducted to test two hypotheses about the specific QTL effect, i.e., ; , if the global test is significant. It is considered that there exists a significant QTL effect on trait l in the considered marker interval if is rejected.
Although the new method has several advantages, there are still aspects that can be improved. One disadvantage of the new method is that the amount of computation increases as the number of markers (intervals) or traits increases. For the computation of one simulated data containing 500 individuals (considering three maker loci and two phenotypes), the EM algorithm takes about 300 iterations to converge in the Windows System with Intel Core i5-10210U 2.11 GHz processor and 16 GB memory, and the memory consumption is about 500 MB. Phenotype amount has a little impact on memory consumption, but the most significant impact factor is the number of marker loci. When dealing with case of multiple real QTLs (or multiple markers), we suggest that two marker intervals constructed by three markers are analyzed by our method at one time. Performing this operation until all markers are completely analyzed will save much running time of the algorithm. Alternatively, we suggest using the idea of a two-step mapping method (Tong et al., 2015), i.e., first detect and retain the markers with larger effects of all the markers. Marker intervals can then be constructed using the selected markers so that the proposed method in this paper can be used with less computational burden. The problem of gene mapping is very important for human disease research, as well as for animal and plant genetic breeding; however, incomplete marker genotypes caused by biological or physical deletions are inevitable. In the future, we will focus on improving the performance of the proposed method in this paper, so that it can adapt better to more complex cases.