The availability of massively parallel transcriptome sequencing technology has improved our understanding of the central dogma processes of transcription and translation and the molecular mechanisms of complex diseases (Koboldt et al., 2013; Modelska, Quattrone & Re, 2015; Wang, Gerstein & Snyder, 2009). RNA sequencing (RNA-Seq) has shown that coding genes are stochastically spliced into different transcripts (called isoforms), including partial exons or selected exons (Kalsotra & Cooper, 2011; Kanitz et al., 2015; Oshlack, Robinson & Young, 2010; Pan et al., 2008). This process is referred to as alternative splicing (AS). The different sequence characteristics of isoforms may exhibit different expression patterns, and these differences further influence the translation of proteins and affect cellular phenotypes (Li et al., 2014).
The elementary problem of transcriptomic data analysis is accurate identification of differentially expressed (DE) genes (Garber et al., 2011). Emerging software packages and pipelines for such analyses have been designed and developed, including methods that rely on gene-level measurement, such as DESeq, edgeR and the two-stage passion model (TSPM), and others based on isoform-level measurements, such as Cuffdiff2, IsoDE and EBSeq (Al Seesi et al., 2014; Anders & Huber, 2010; Lehmann et al., 2011; Li & Tibshirani, 2013; Robinson, McCarthy & Smyth, 2010; Trapnell et al., 2013). The gene-level quantification methods assume that the sum of isoform expression levels constitutes the expression of one gene, which may ignore the variability of different transcripts (Trapnell et al., 2013; Trapnell et al., 2010). Thus, the development of a method for considering isoform-level measurements is of increasing importance. For example, considering both the splicing structure and expression levels of isoforms, Cuffdiff2 uses the beta negative binomial distribution to define DE isoforms (Trapnell et al., 2013). Unfortunately, testing for individual isoforms separately neglects the relationships between each isoform (Dialsingh, Austin & Altman, 2015).
Currently, the idea of a set test, which supposes that some related variables form one set and tests the set, likely overcomes the drawbacks of gene-level measurement methods and the disadvantage of testing isoforms individually. The theoretical basis is that the score statistic of the variance component in a generalized linear mixed model (GLMM) follows the mixture chi-squared distribution with the null hypothesis (Lin, 1997). The sequencing kernel association test (SKAT), a popular algorithm in genome-wide association studies (GWASs), assumes that the effects of each locus are in the same functional region as random effects and uses the score statistics to test the set (Wu et al., 2011). Since the estimation process is under the null hypothesis, this method reduces computation time. The test for the effect of gene set (TEGS), which regresses gene expression against phenotypes, is widely used to select DE genes in a microarray platform. The random effect and the residual error of the model are due to the disease effect and the correlation between genes, respectively (Huang & Lin, 2013). The set test is effectively applied to analyze other types of high-dimensional genomic data (Ionita-Laza et al., 2013; Wu et al., 2016).
Considering the disadvantages of traditional algorithms in selecting heterogeneous genes and the accessibility of the set test, we apply the idea of the set test to identify DE genes in isoform-level measurements. This study involved three parts. First, the relationship between the expression level of one gene and the phenotype is formulated as a GLMM with the assumption of a Poisson and negative binomial (NB) distribution. Second, we study the empirical and theoretical distributions of score statistics and measure their statistical performance for different parameter settings. Third, we analyze level 3 mRNA-Seq data on lung squamous cell carcinoma (LUSC) from The Cancer Genome Atlas (TCGA). Comparisons with traditional algorithms are described in the ‘Simulations’ and ‘Real data analysis’.
Assume that there are Nsubjects in the studied sample, and suppose that for all individuals there is one sequenced gene with p isoforms. For the ith individual, Yi1, Yi2, …, Yip denotes the RNA-seq data of one gene. The vector Y is assumed as a Poisson or NB distribution. With the assumption of independent samples, xi is the covariate of individual i. For example, its value is 0 or 1 in a case-control study. We assume αj to be a random term that follows a normal distribution. Its variance represents the heterogeneity of the baseline abundance of the isoforms. This distinct assumption is the difference between gene-level measurement methods and isoform-level measurement methods. The disease effect is also defined as a random effect. Based on the above assumptions, a GLMM of two random terms is constructed as follows: where is a monotonic differentiable link function, such as for the Poisson regression. If xi is equal to zero in a case-control study, μ + αj indicates the expression of the jth isoform in the control group, and βj indicates the disease effect.
The matrix form is written as follows: is an N × p vector consisting of NYi vectors (Yi = (Yi1, Yi2, …, Yip)T). μ is also an N × p vector, in which all elements are is an Np × p matrix in which Ip is the p-dimensional identity matrix, , , and .
Theoretical null distributions of statistics
In fact, the test of the disease effect is performed to test the variance component of the second random effect. As the basic assumption of mixed models, the third term in the above equation follows a normal distribution () (Gad & El Kholy, 2012). Therefore, the test of β equals the test of its variance component τ. The final null hypothesis (H0) of the model is written as τ = 0. The idea of a score test is used to test τ. Based on the theory of score statistics in a GLMM and the special assumption of this model, the null distribution and parameter estimation of the statistic U is as follows:
First, the matrix form of the first order derivate of the log-likelihood is as follows: where Δ−1, W and W0 are block diagonal matrixes, and . Each block of the three matrices is a diagonal matrix whose diagonal elements are similar. Assuming , their diagonal elements are , where . With different variances and the same link function, Δ−1 of Poisson and NB distribution is the same, but W and W0 are different. For the Poisson assumption, the product of Δ−1 and W is an identity matrix, and W = W0. For the NB assumption, we take the product of Δ−1 and W as Φ, which has diagonal elements of . ϕ is the scale parameter of an NB. The diagonal elements of the block of W0 are .
Second, the statistic U depends on the first derivate of the log-likelihood function. The formula for the U statistic depends on an assumption. Under the Poisson assumption, U is equal to , while under the NB assumption, U is formulated as . We estimate and of the model under H0.
Third, the chi-square statistic is shown as . The formula of the information matrix is as follows: where , and . Then, the formula is rewritten as follows: where .
Finally, suppose A = KTK with a diagonal element of aii. aii consists of a vector, denoted as . Let κ2i, κ3i and κ4i be the cumulants of Yi. With the assumption of an exponential family of distributions, the relationships between the three cumulants are as follows:
Let R ∈ ℝNp×Np be a block diagonal matrix. In each block, the diagonal elements and non-diagonal elements are calculated as and , respectively. C ∈ ℝNp×Np is composed of a p diagonal matrix, Cj, whose diagonal elements are . Then, the elements of the information matrix are as follows: where J is an Np vector, and its elements are 1. The difference in variances leads to different expressions for R and C. For the Poisson assumption, the diagonal and off-diagonal elements of R are and , respectively. The diagonal elements of C are . For the NB assumption, the diagonal and off-diagonal elements of R are and , respectively. The diagonal elements of C are .
Empirical distribution of statistics
The permutation algorithm generates the empirical distribution. The two different assumptions cause different score statistics. For the Poisson assumption, the statistic is . For the NB assumption, the statistic is equal to . The hypothesis testing of U follows two steps: (a) construction of the empirical distribution by shuffling the label and (b) calculation of the P value of the original statistic U. The estimation of μ and α is similar to the above. To improve robustness, the default setting of the number of permutations is 5,000.
We perform simulations to examine the type I error and power of the proposed score statistics, U, for identifying differential expression under a range of scenarios. The parameter settings are based on the analysis of real data and references (Lehmann et al., 2011). We assume that the RNA-Seq data follow NB and Poisson distributions. Five parameters are involved: the variance of α (s), the variance of the disease effect (l), the number of isoforms (p), the logarithm of expression levels () and the dispersion parameter (d). s shows the heterogeneity of the baseline abundance of each isoform. l refers to the heterogeneity of the disease effect. The other three parameters describe the characteristics of the expression levels. The variance of α varies from 0.25 to 1.75 in steps of 0.75. The variance of β varies from 0.20 to 1.00 in steps of 0.4. The number of isoforms can be 2, 4 or 8. d is set at 0.5 or 2. We let be 2.5 or 5. The sample size is fixed at 40. In the type I error simulations, the effect size is set to zero (l = 0).
We compare our approach with three traditional algorithms: DESeq, edgeR and TSPM (Anders & Huber, 2010; Lehmann et al., 2011; Robinson, McCarthy & Smyth, 2010). The sum of isoforms is assumed to represent the expression level of one gene in these methods. Therefore, the loop number is the gene number in simulations. When l is equal to 0, the proportion of significant genes is the false positive rate.
Real data analysis
We download the LUSC Batch 101 mRNA sequencing data from TCGA (Network, 2012). The sample size is 12 for the case-control traits. The number of genes is 20,533. Each sample has four files: gene expression, normalized gene expression, isoform expression and normalized isoform expression. We only used the gene expression and isoform expression datasets. These data are attached as File S1. To utilize the advantages of isoVCT, the candidate genes are the genes showing a higher expression level than the total sample, with isoforms that exhibit heterogeneous effect sizes. Finally, we select 6,134 genes.
Software and algorithms
This algorithm is completed using the Microsoft R Open (v3.3.0; Microsoft Corp., Seattle, WA, USA). The functions glmer and glmer.nb in the lme4 package fit the mixed models in our algorithm. The function ginv in the MASS package is employed to calculate the generalized inverse of the singular matrix. The packages DESeq and edgeR are both from Bioconductor. In consideration of the potential computing time of the simulations, the doParallel package is used for parallel computation.
This algorithm provides self-adaptation for real data analysis. If the fitness of the NB assumption fails or the dispersion parameter is close to one, the Poisson assumption is used. Due to the application of the idea of the variance component test, our method is referred to isoVCT. The distribution of the score test statistic is fitted via theoretical and empirical methods, which we term isoVCT-The and isoVCT-Emp, respectively. The R program of isoVCT and the simulation code are attached as File S2.
Simulations with the NB assumption
The results for the empirical type I error for the NB assumption are shown in Table 1 and Fig. 1. s, l and p may be unrelated to type I error rates, but ϕ may show an inverse correlation to the type I error rate. isoVCT-The is exactly conservative in specific scenarios. The type I error rate of isoVCT-Emp is close to 0.05, which might mean that the permutation can fit the distribution of U in a small sample size setting. DESeq can control type I errors; however, the type I error rates of edgeR and TSPM are both dispersed, especially when ϕ = 0.5. Generally, isoVCT can control type I error rates around the nominal level for the NB assumption.
|M||s||p||ϕ = 2||ϕ = 0.5|
The results regarding empirical power for the NB assumption are shown in Tables 2, 3 and Fig. 2. Three findings of this analysis are as follows: (a) M exhibits a positive correlation with power; (b) p and s exhibit a negative correlation with power; and (c) l also exhibits a positive correlation with power, but its increase is much sharper than those of p and M. Our results show that isoVCT-Emp is more advantageous than traditional methods, especially when the effect size is small and the effect heterogeneity is large.
|l||s||p||ϕ = 2||ϕ = 0.5|
|l||s||p||ϕ = 2||ϕ = 0.5|
Simulations with the Poisson assumption
The results of the type I error and empirical power under the Poisson assumption are shown in File S3.
Real data analysis
The results of real data analysis are shown in Fig. 3. Only 4,422 genes are fit to the GLMM. The three traditional algorithms, DESeq, edgeR and TSPM, define 364, 259 and six DE genes, respectively. The intersection of DESeq and edgeR is 221 genes, representing a high proportion of the DE genes that these algorithms define. TSPM is almost ineffective for these genes.
The results of isoVCT are as follows. First, isoVCT-Emp defines 263 DE genes; of these genes, isoVCT-Emp specifically selected 242 DE genes, and only a small portion of these genes is also found by the other algorithms. Second, the DE genes identified by isoVCT-Emp included those that are identified by isoVCT-The. In general, isoVCT is superior at selecting heterogeneous genes.
Here, we propose isoVCT, a variance component score test for testing the coefficients of covariates to select DE genes. Simulations and real data analysis both suggest that this method is advantageous in selecting weak effects or heterogeneous genes. The results of the simulations are discussed from the following two aspects. First, isoVCT controls type I errors in each setting under any of the distribution assumptions. The type I error rate of the empirical distribution is almost controlled. However, the type I error rate of isoVCT-The is strictly controlled, which is likely caused by an inaccurate estimation of the U statistic in the small sample size setting. The type I error rates of edgeR and TSPM increase in some settings, especially with the NB assumption (Yang et al., 2015). Second, the power of isoVCT is higher than that of other methods; furthermore, the strength is evident for the small l and NB assumption. isoVCT-Emp is superior to other methods in each setting,especially for l = 0.6. The small sample size setting may cause the inverse proportion between p and the power of isoVCT. Nevertheless, the powers of five algorithms are high for the Poisson assumption.
In real data analysis, the Venn diagram of DE genes directly indicates the relationships among the four algorithms. The fact that DESeq and edgeR use the same distribution assumption and the same idea for testing likely results in the number of intersections representing a majority of the DE genes identified by these algorithms. TSPM defines the smallest number of DE genes. However, the use of different ideas leads to the different result of isoVCT. isoVCT-Emp specifically defines 242 heterogeneous or small effect size genes. For example, CASP7 and STAT6 are related to LUSC (Dubey & Saini, 2015; Lee et al., 2009).
Furthermore, isoVCT exhibits three innovations. First, isoVCT derives the empirical and theoretical distribution of the variance component score statistics in the framework of a GLMM and evaluates the performance of score statistics. Second, the random effect α represents the correlation of isoforms in the framework of the GLMM. Third, isoVCT further verifies the effectiveness of the set test in RNA-Seq data and supplies a new view of biological functions with isoform information.
However, isoVCT has some limitations. In the derivation of the score statistic, the random term α is regarded as a fixed parameter, which may cause isoVCT-The to be quite conservative. For the NB assumption, the power of isoVCT is very low in some settings, because the estimation of the dispersion parameter may be biased. The methods of weighted likelihood and quasi-likelihood likely overcome this drawback in estimating the NB mixed model (Lund et al., 2012).
In conclusion, isoVCT, a supplement to DESeq or edgeR, is powerful and robust at selecting small effect and heterogeneous genes.