Visitors   Views   Downloads
Note that a Preprint of this article also exists, first published July 18, 2014.


Next generation sequencing (NGS) is an extremely powerful tool for the analysis of mixed populations in ecology and biology, providing a means to assess community composition (metagenomics) whilst also aiding the discovery of new species (Radford et al., 2012). The relatively high error rate of NGS, however, limits the ability to analyze population diversity directly. Population diversity is a crucial ecological measure, relating particularly to selection pressure and (relative) evolutionary fitness. Diversity is critical within biological arenas as well, for instance T cell and viral diversity are key complementary measures of the immune system and infection status respectively. There is thus a pressing need to create tools to appropriately correct NGS error, to estimate diversity introduced by NGS and to estimate inherent virus diversity.

This study was motivated by the prevalence and pathogenicity of honeybee deformed wing virus (DWV) and related viruses. It is known that these may cause both asymptomatic low-level and symptomatic high-level infection in honeybees; the Varroa mite is the likely causal factor that produces a shift from the benign to the pathogenic state, correlating with a shift in the level of viral population diversity (Martin et al., 2012; Ryabov et al., 2014). It was the particular need to test for a change in honeybee viral population diversity that stimulated the current work; for this reason, datasets used here relate to honeybee deformed wing virus populations.

The Varroa mite is now endemic in honeybee colonies throughout Europe, North America and New Zealand, causing significant colony losses (Neumann & Carreck, 2010). Varroa acts as a vector for a range of honeybee viral pathogens, most important of which is the picorna-like deformed wing virus, a close relative of Varroa destructor virus-1 (VDV-1). At high Varroa levels a virus most closely resembling a recombinant between published DWV and VDV-1 sequences is amplified in the colony, leading to overt deformed wing disease and colony loss (Ryabov et al., 2014). Thus, to determine the degree of selection for that strain there is a need to compare viral diversity between Varroa-free and Varroa-infested honeybee viral RNA samples. Viral RNA samples, extracted from individual newly-emerged adult bees from Varroa-free or Varroa-infested colonies, were subjected to next-generation sequencing (NGS) and the resulting reads used to compare the diversity in the distribution of nucleotides at each nucleotide position, from one sample to the other. Next generation sequencing, however, is subject to error which in turn inflates diversity measures. An accurate measure of nucleotide diversity at each position is needed in this context, and thus the statistical structure of the nucleotide distribution can be used to achieve a means for correction.

An extensive literature on the correction of next generation sequencing error is available, entirely devoted to methods for correction of individual nucleotides rather than the distribution of nucleotides at each position. A useful recent review of the existing literature can be found in Yang, Chockalingam & Aluru (2013), where it is pointed out that all methods depend on “alignment and consensus”, with more powerful approaches being developed over time. Methods are classified into three types: k-spectrum based (all k-mers are identified, then clustered by Hamming distance and clusters assumed to come from the same genomic location), suffix tree/array-based (a generalisation of the k-mer approach which accommodates multiple values of k) and methods dependent on a multiple sequence alignment. Early methods (Quince et al., 2009; Zagordi et al., 2011) assumed that errors occur at random while Skums et al. (2012) builds in known platform influences. In Macalalad et al. (2012) the authors utilised the fact that biological variants and process errors exhibit different covariation. Here we develop a simple but effective method for nucleotide distribution correction. This is a problem at a coarser level than rectification of the assignation of each nucleotide, allowing us to use a different approach. Additionally, the method appears to be unique in that it uses a standard sequence for the purposes of error calibration.

A careful examination of the distribution of Shannon diversity first appeared in the ecology literature (Hutcheson, 1970). Beerenwinkel et al. (2012) provides an excellent overview of the state-of-the-art in estimating viral diversity from next-generation sequencing data; diversity can be measured at single sites (for single nucleotide variant detection), locally in windows of a multiple sequence alignment and globally over the entire genomic region. In Gregori et al. (2014) the authors look at three measures of diversity and compare their values (the variance of diversity is incorrectly stated in Gregori et al. (2014), but corrected here in the Diversity analysis section). To conclude the review of literature on a broader note, in Schreiber & Brown (2002) a method for correcting systematic bias in the nucleotide distribution (the “distortion”) in a genome is presented.

The novel stochastic method for NGS error correction of diversity measures introduced here is based on use of an evolutionary model (in the first instance, the Jukes–Cantor model (Jukes & Cantor, 1969)) and requires the availability of an accurately known sequenced sample. The corrected nucleotide distribution is used to calculate both the expected diversity and the diversity variance. These in turn can be used to test hypotheses about single diversities, or to test the hypothesis of equality of diversity across two samples. Hypotheses about mean diversities and comparisons of mean diversities of two samples are also discussed, taking into account correlation of diversities across the averaged sites. Finally, the ideas are used to produce a diversity threshold for a sample to be consistent with a clonal (zero nucleotide diversity) population. The methodology is illustrated with NGS datasets drawn from the honeybee study. The methods discussed extend our understanding of NGS error correction and provide a core set of tools useful in the study of diversity.

Materials and Methods

Error and diversity correction

We first correct the NGS error in the nucleotide distribution and then use this distribution to calculate the (corrected) diversity. Our starting point is the true (to be estimated) nucleotide distribution, p. Assuming independence of errors, the error introduced through NGS can be modelled by a 4 × 4 matrix M whose entries give the probability of a particular nucleotide change during sequencing, conditional on the initial value of the nucleotide, whence the theoretical distribution of nucleotides following sequencing is q = Mp. In practice we sequence finitely many nucleotides n so the empirical distribution is qˆ with Eqˆ=q. The simplest error model is the Jukes–Cantor model, where a nucleotide mutates to any other with equal (low) probability α, summarised as M=Mα=PA|APA|CPA|GPA|TPC|APC|CPC|GPC|TPG|APG|CPG|GPG|TPT|APT|CPT|GPT|T=13ααααα13ααααα13ααααα13α. Since Mp = p + α(1−4p), this transformation has the effect of moving all probabilities closer to 0.25, thereby increasing diversity. Richer mutation models can be incorporated, for example allowing different mutation rates between purines and pyrimidines; for this study the Jukes–Cantor model is sufficient for our purposes.

The NGS error can be partly corrected by reversing the mutation process, first multiplying qˆ by M−1. In practice, thanks to the sampling variation introduced during NGS, the reversal can yield negative components. For this reason we must bring these back to zero and then normalize so that the nucleotide proportions sum to one. Thus, the corrected nucleotide distribution rˆ is the component-wise maximum of M1qˆ and (0, 0, 0, 0), denoted M1qˆ+, normalised, so rˆ=M1qˆ+/i=14M1qˆ+i. Since M1qˆ=qα/14α, this transformation has the effect of moving all probabilities away from 0.25, so decreasing diversity. Only when qˆ=Mp does this reversal process work exactly, returning p. Figure 1A lays out these stages, from p, the true nucleotide distribution in the organism, to qˆ the distribution following NGS, to rˆ, the corrected distribution. To illustrate this numerically, with p = (1, 0, 0, 0), α = 0.001 and n = 10,000 then (n1, n2, n3, n4) could be (9989, 4, 3, 4) (using a multinomial distribution) whence qˆ=0.9989,0.0004,0.0003,0.0004, M1qˆ=1.0019,0.0006,0.0007,0.0006, M1qˆ+=1.0019,0,0,0, and rˆ=1,0,0,0=p. In this example the correction is exact, but this is not always the case, whence the estimate rˆ will still be subject to error.

The stages of NGS nucleotide distribution error correction.

Figure 1: The stages of NGS nucleotide distribution error correction.

(A) Error is introduced to the true nucleotide distribution p, giving distribution qˆ, the result of multinomial sampling with n (the coverage, the number of nucleotides sequenced) trials in which the true sequenced nucleotide distribution is given by q = Mp. The error is corrected by forming rˆ, the normalisation of M1qˆ+. In practice M must be estimated (the calibration step), using a known initial distribution p. (B) The initial diversity H(p) increases to Hqˆ under NGS, but falls back to Hrˆ when corrected.

To utilize this correction method, we need the matrix M; in the case of the Jukes–Cantor model this requires an estimate of the mutation rate α. This can be achieved by estimating it on a sample of known composition and genotype, denoted here the “standard” sample. This provides a two-step method to correct an empirical nucleotide distribution qˆ for NGS error. The first (calibration) step uses p standard and qˆ standard to estimate α (as αˆ, so giving an estimate Mˆ of M) while the second (correction) step uses Mˆ and qˆ to estimate p with rˆ, as follows:

  1. Calibration Step. This is the estimation of α, the error rate in NGS sequencing. We begin with a sample in which the nucleotide mix p standard = (p1,p2,p3,p4)T is known at each nucleotide position along a genome. We then sequence the mix using NGS, giving nucleotide mix qˆ standard=qˆ1,qˆ2,qˆ3,qˆ4T. For each site we can solve for αˆ using qˆ=Mˆp. Equating ith components provides estimates αˆi=qˆipi/14pi for i = 1, …, 4; these can be averaged across nucleotides to give a lower variance estimate of α, αˆ=αˆ1+αˆ2+αˆ3+αˆ44=14i=14qˆipi14pi. These estimates can be averaged across nucleotide positions to improve the estimate further.

  2. Correction Step. Given nucleotide distributions qˆ produced from NGS of a sample, form the corrected counts rˆ=Mˆ1qˆ+/i=14Mˆ1qˆ+i, where Mˆ=Mαˆ.

Example 1 in the Results section illustrates these calibration and correction steps with a numerical example.

It is possible to estimate an upper bound for the residual error rate remaining after correction, given an NGS error rate α and coverage n. This is done by using the calibration step with p standard = (1, 0, 0, 0) and qˆ standard=rˆ, denoting the estimated residual error rate by βˆ. Here rˆ is found by first generating n values using q = M(α) p, then correcting using rate α. A graph of βˆ against α is given in Fig. 2, for the case p standard = (1, 0, 0, 0) and n = 1,000, 2,000, 5,000, 10,000 and 20,000. This demonstrates, for these parameters, that correction reduces the error by a factor of over 10 for a coverage of n = 1,000. That this is an upper bound is made clear in the later Discussion section.

The estimated residual error rate $\hat {\beta }$βˆ after correction plotted against the initial error rate α, when the true initial distribution is the degenerate p = (1, 0, 0, 0).

Figure 2: The estimated residual error rate βˆ after correction plotted against the initial error rate α, when the true initial distribution is the degenerate p = (1, 0, 0, 0).

The larger the coverage n, the more accurately qˆ estimates q and so the more dependable the correction, hence the smaller is β. Curves are shown for coverage n of 1,000 (blue), 2,000 (green), 5,000 (red), 10,000 (cyan) and 20,000 (magenta). Each point on the graph is the mean of N = 5,000 replicate trials.

Given a nucleotide probability mass distribution p = (p1, p2, p3, p4) the nucleotide diversity is H=i=14pilog4pi. This can be thought of as the geometric mean ipipi of the distribution probability masses, transformed by taking the natural logarithm of the reciprocal; then no diversity, for example (1,0,0,0), has H = 0 while complete diversity, a uniform distribution of (0.25, 0.25, 0.25, 0.25), has H = 1. Figure 3 shows the graph of a component −xlog4x of diversity. The steep slope at the extremes, particularly at zero, shows that correction to a component at these extremes has most effect on diversity measurement. Of interest here is detection of clonality, where H = 0, so correction of very small probabilities is critical.

The graph of a component of diversity, −xlog4x.

Figure 3: The graph of a component of diversity, −xlog4x.

The gradient converges to infinity as x goes to zero. A consequence is that corrections in this region have a large influence on diversity measurement.

With the corrected nucleotide distribution available, the Shannon diversity can be estimated as Hˆ=Hrˆ=i=14riˆlog4riˆ. Figure 1B schematically shows the relative levels of H(p) (true diversity), Hqˆ (uncorrected diversity) and Hrˆ (corrected diversity). We conclude this section by remarking that the data used for calibration should ideally be sequenced under similar conditions to the data which is to be corrected.

Diversity analysis

Two questions are key in an analysis of population diversity. Firstly, whether the diversity H is equal to a given value ℋ and secondly, whether two populations have the same diversity, or alternatively whether one is more diverse than the other. Specifically, we want to test null hypotheses, firstly of the form H0: H = ℋ and secondly of the form H0: H1 = H2, for two samples labeled ‘1’ and ‘2’. We will examine both of these tests for diversity measured at a single nucleotide position and for the mean diversity across segments comprising multiple (N) possibly correlated sites.

In all cases we will have one or more true nucleotide distributions p, true post NGS nucleotide distributions q = Mp, estimates qˆ determined using a sample of n nucleotides drawn from q, so qiˆ=ni/n for each i = 1, …, 4, where ni is the number of draws resulting in the ith nucleotide and finally, corrected distributions rˆ. Given rˆ, Hˆ=Hrˆ, as in (2), estimates the true diversity H=ipilog4pi with H evaluated at corrected values rˆ of estimates qˆ of true values after error introduction, q. When n is large, this approximation will be close to the true value. In general, since the diversity function is concave, diversity is increased when we replace p with q and decreased when we replace q with qˆ (since Hˆ is known to underestimate H (Hutcheson, 1970)) and then rˆ (since rˆ corrects qˆ). This compensation is fortuitous.

The variance of Shannon diversity VarHˆ at a nucleotide position with coverage n and empirical distribution rˆ=rˆ1,rˆ2,rˆ3,rˆ4 is approximated to first order (Hutcheson, 1970) with the following expression, VarHˆ1ni=14rˆilog4rˆi2i=14rˆilog4rˆi2. This can be shown using the Δ-method and that Varrˆi=ri1ri/n and Covrˆirˆj=rirj/n, where Erˆi=ri for i = 1, …, 4.

For testing and estimation we also need the variance of the mean of diversities averaged across a segment of sites. For ease of presentation (though it is readily generalised), we now assume that (3) provides the variance of estimated diversity at each nucleotide position, denoted s2. Empirical evidence shows that diversity is correlated across positions. We assume an exponential decay of the diversity correlation, with correlation between nucleotide positions i and j given by ρji+1. Then the average diversity across N consecutive nucleotide positions, denoted H̄, has variance VarH̄ given by VarH̄s2N1+2N1Nρ1ρ2N2ρ1ρ21ρN1 =s2N1+2ρ1ρfor N large. We estimate ρ using the correlation between adjacent diversities (the one-step lagged correlation) along the segment of interest. With these variance estimates we can conduct hypothesis tests and construct confidence intervals in all cases considered. These are summarised in Table 1 and illustrated numerically in Example 2 in “Results”.

Table 1:
Test statistics and two-sided 95% confidence intervals, in brackets, for diversity testing and estimation with a single position and one or two samples (columns).
A population mean is denoted by μ. For the corresponding formulae involving means across multiple positions, H̄ replaces Hˆ.
Number of samples
1 2
Hypothesis Test statistic z Hypothesis Test statistic z
μH = ℋ H ˆ Var H ˆ μH1μH2 = 0 H ˆ 1 H ˆ 2 Var H ˆ 1 + Var H ˆ 2
[Hˆ±z0.975VarHˆ H1ˆH2ˆ±z0.975VarHˆ1+VarHˆ2]
DOI: 10.7717/peerj.645/table-1

Clonal threshold estimation

Suppose that a clonal sample (one with zero diversity) is sequenced by next generation methods with error rate α. This is corrected to level βˆ, as shown in Fig. 2. This level of error remains in the sample, whence an upper 95% threshold for H̄, the average diversity across N sequential positions, with lagged correlation ρ, is H̄+z0.95VarH̄ where z0.95 = 1.64 is the 95th percentile of the standard normal distribution. Note that this threshold depends on both the coverage n and the segment length N. A larger n typically decreases H̄ and so the threshold level, while a larger N decreases the width of the confidence interval around H̄, so also decreases the threshold level.

Sequence data

Three viral samples were sequenced by next generation sequencing (Illumina HiSeq1000). These comprised a sample that could be used for estimation of the error (standard sample), and two experimental samples, from low viral load and high viral load bees. After trimming adapters and barcodes, the first reads (the reads are paired), 101nts long, were used to calculate the NGS error rate and illustrate the methods.

Standard sample: an accurately known mix of two viral recombinants, X59

This was composed of two of the DWV-like viral recombinant RNA genomes described in Moore et al. (2011), VDV-1DV D and VDV-1V V D (NCBI Accession Nos. HM067437 and HM067438 respectively). The mixture was produced by in vitro RNA transcription using linearized plasmid clones with full-length cDNA inserts of VDV-1DV D and VDV-1V V D (Moore et al., 2011) with the T7 mMESSAGEmMACHINE kit (Ambion). The RNA transcripts were purified using RNAeasy columns (Quiagen), quantified, mixed 25% VDV-1DV D, 75% VDV-1V V D and sequenced using the Illumina platform protocol. The NGS reads are available in the EBI Sequence Read Archive, study accession PRJEB5249, ERS395188. The reads were aligned using Bowtie 2 (Langmead & Salzberg, 2012) to a single reference sequence (VDV-1DV D). SAMtools mpileup (Li et al., 2009) was used to produce the number of nucleotides covering each position in the reference. We excluded positions of mismatches between the VDV-1DV D and VDV-1V V D sequences, about 4% of the genome length. This produced nucleotide pileups where all diversity was a result of the methods used during NGS sequence acquisition. This dataset is used for calibration of NGS error.

A viral population mix from low DWV level (F3) and from high DWV level (E7) honeybees

Samples from honeybee nurse bees for which qRT-PCR showed either low DWV levels (EBI SRA study accession PRJEB5249, ERS395182) or high DWV levels (EBI SRA study accession PRJEB5249, ERS395180) were used. In each case, in order to assemble the required nucleotide distribution data, the following computational steps were carried out. “First reads” of length 101nt were aligned using Bowtie 2 to DWV (NC_004830) and VDV-1 (NC_006494) reference sequences. SAMtools mpileup was then used to produce the number of nucleotides covering each position in the references.

Deformed wing virus

A map of the deformed wing virus adapted from Lanzi et al. (2006) is shown in Fig. 4. In the Results section we select a single site in the helicase region of the genome (Examples 2.1 and 2.2), and average across the capsid region (Examples 2.3 and 2.4).

Map of the genome of the deformed wing virus, a positive-stranded RNA virus.

Figure 4: Map of the genome of the deformed wing virus, a positive-stranded RNA virus.

The capsid region codes for the four structural proteins, multiple copies of which are used to build the icosahedral virus shell, while the helicase region is involved in the unfolding of the RNA strand. Untranslated regions (UTRs) flank the coding sequence.


Example 1. Calibration and error correction (i) Calibration—estimation of the NGS error rate α. Nucleotide pileups resulting from the X59 mix of two viral recombinants were used for calibration of the NGS error rate. Nucleotide positions where DWV and VDV-1 are identical were selected; for these the true distribution is one of p = (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) or (0, 0, 0, 1). For example, the upper half of Table 2 shows the pileup counts for the first three capsid (see Fig. 4) nucleotide sites. Since the p here place all weight on one nucleotide, Eq. (1) reduces, when p = (1, 0, 0, 0), for example, to qˆ2+qˆ3+qˆ4/3 or the average error proportion. Thus, for the first position in Table 2 we would estimate α as (0.0010 + 0.0016 + 0.0008)/3 or 0.0011; for the second and third positions the corresponding averages are 0.0014 and 0.0007, giving a local average of 0.0011. This was done for the entire capsid region of the genome (2,843nts in length), yielding αˆ=0.001949, so the overall NGS error rate is estimated at a little under 2 in 1,000.

(ii) Correction of the distributions qˆ. This was carried out in three ways.

  1. Within the capsid region of X59, using αˆ=0.001949 calculated using the entire capsid region. Correction of the distribution qˆ in the first three capsid positions, shown in the upper half of Table 2, yielded the rˆ distributions shown; each is corrected to the exact distribution. (The Excel software available was used to also test the calibration-correction routine across a central region in the helicase and a non-structural region downstream of the helicase; in both cases the degenerate distributions were found with excellent accuracy.)

  2. Within the capsid region of E7, again using αˆ=0.001949 calculated using the entire capsid region of the independent standard sample X59. The accurate results are shown in the lower half of Table 2.

  3. Within the capsid region of X59, using cross-validation. The capsid region was divided into sites with odd index (for calibration) and sites with even index (for correction). The calibration calculation here gives αˆ=0.001582. Two arbitrarily chosen even indices in the correction region (where the true distribution is known to be degenerate) had associated raw nucleotide counts {ni} of (71, 91461, 39, 36) (so qˆ=0.0008,0.9984,0.0004,0.0004) and (134, 60, 90, 65615) (so qˆ=0.0020,0.0009,0.0014,0.9957). The corrected distributions rˆ are (0, 1, 0, 0) and (0.0005, 0, 0, 0.9995) respectively, showing considerable improvement.

Table 2:
Correction of NGS errors.
The raw nucleotide counts, the relative frequency distribution qˆ of nucleotides after next generation sequencing and the relative frequency distribution after correction rˆ are shown for the first three X59 dataset capsid positions (upper half of table); here calibration and correction are both based on X59. For each position the correct distribution of bases was recovered. The corresponding data and corrections for dataset E7 (a dataset independent of the calibration dataset X59) are shown in the lower half of the table, the method again yielding complete correction.
Position no. A C G T Diversity
1 Counts {ni} 87,573 86 142 73
Proportion qˆ 0.9966 0.0010 0.0016 0.0008 0.0222
Proportion rˆ 1 0 0 0 0
2 Counts {ni} 90 130 156 86,961
Proportion qˆ 0.0010 0.0015 0.0018 0.9957 0.0270
Proportion rˆ 0 0 0 1 0
3 Counts {ni} 68 29 87,998 93
Proportion qˆ 0.0008 0.0003 0.9978 0.0010 0.0149
Proportion rˆ 0 0 1 0 0
1 Counts {ni} 13,267 11 16 11
Proportion qˆ 0.9971 0.0008 0.0012 0.0008 0.0161
Proportion rˆ 1 0 0 0 0
2 Counts {ni} 11 25 15 13,215
Proportion qˆ 0.0008 0.0019 0.0011 0.9962 0.0208
Proportion rˆ 0 0 0 1 0
3 Counts {ni} 8 9 13,440 10
Proportion qˆ 0.0006 0.0007 0.9980 0.0007 0.0120
Proportion rˆ 0 0 1 0 0
DOI: 10.7717/peerj.645/table-2

Example 2: Diversity testing and confidence intervals

2.1 Single site, single population

The nucleotide distribution, in A, C, G, T order, at arbitrarily chosen position 5201 (within the helicase region of the viral genome, Fig. 4) in F3 is (1, 0, 58, 1) giving qˆ=0.0167,0,0.9667,0.0167. Correction using estimate αˆ=0.001949 yields corrected proportions rˆ=(0.0148, 0, 0.9704, 0.0148). True diversity is then estimated as Hrˆ=0.1110 using (2) and VarHˆ=0.0044 using (3), whence the test statistic z is 1.6814 with p-value of 0.0463, leading to rejection of the null hypothesis that the true diversity ℋ = 0 at the 5% level. A 95% confidence interval for true diversity is [0, 0.2404].

Sample E7 at nucleotide position 5201 has counts of (38, 35, 26497, 48) giving qˆ=0.0014,0.0013,0.9954,0.0018 and rˆ=0,0,1,0. True diversity is estimated as Hˆ=0 and VarHˆ=0 (since (3) converges to zero in this case), whence the hypothesis of zero diversity is not rejected. A 95% confidence interval for true diversity is {0}.

2.2 Single site, two populations

We test whether the diversities are equal at nucleotide position 5201 for F3 (first sample) and E7 (second sample), using the data in Example 2.1. The test statistic z described in Table 1 is 1.6814 with p-value = 0.0463. Hence we reject the null hypothesis of equal true diversities for the two populations at this site, concluding that F3 has higher diversity. A 95% confidence interval for the true diversity difference is [−0.0184, 0.2404].

2.3 Multiple sites, single population

We test whether the mean diversity across the capsid region (see Fig. 4) of the viral genome is zero, for both E7 and F3 (N = 2, 843). For E7, H̄=0.0037, the serial correlation of diversities is estimated as ρˆ=0.1728 and mean site diversity variance is s2 = 7.714 × 10−7 whence VarH̄=3.847×1010 using (5), giving z = 187.5 so that the p-value is negligible. Here the sample size is so large and hence the power so great that we conclude that the true diversity is non-zero. Of greater interest is a 95% confidence interval for the true diversity, namely [0.0036, 0.0037]. Corresponding calculations for the more varied F3 sample give values of H̄=0.0410, ρˆ=0.1195, s2 = 0.0019, VarH̄=8.702×107, z = 43.98 and again a negligible p-value. A 95% confidence interval for the true mean diversity value is [0.0392, 0.0429].

2.4 Multiple sites, two populations

We test the null hypothesis that the true mean diversities, over the capsid region, of F3 (first sample) and E7 (second sample) are equal. Using values already given in Example 2.3, we find that z = 40.03, again giving a negligible p-value. Hence we conclude that the mean diversity of F3 is greater than that of E7. A 95% confidence interval for the true mean diversity difference is [0.0355, 0.0392].

Example 3: Clonal threshold

Using αˆ=0.001949 and with coverage of n = 20,000 we find that βˆ=0.0001233. This gives Hβˆ=0.002748, by direct calculation using rˆ=log413βˆ13βˆ3log4βˆβˆ. Then s=VarHβˆ=0.0008940 using (3). The 95% clonal diversity threshold for a single site is therefore 0.002748 + 1.64 × 0.0008940 or approximately 0.0042. For mean diversity, with N = 2, 843 and ρˆ=0.1728 (appropriate for E7 on the capsid) the 95% clonal mean diversity threshold is Hβˆ+1.64s/N12ρˆ/1ρˆ=0.0028.

Figure 5 illustrates the confidence intervals found in Example 2.3, labelled “After correction”, together with corresponding confidence intervals for the uncorrected data, labelled “Before correction”. The corrected clonal mean diversity threshold of Example 3 is also shown. The highly significant difference between the Varroa-free and Varroa-infested nucleotide diversities are evident. Correction of the low diversity sample has a far greater effect on diversity than correction of the high diversity sample, due to the steep slope of the diversity component near zero (Fig. 3). The corrected low diversity sample lies just above the clonal threshold.

Confidence intervals for true mean diversity across the capsid region, for the high diversity (F3) and low diversity (E7) samples, both before and after correction for the NGS error.

Figure 5: Confidence intervals for true mean diversity across the capsid region, for the high diversity (F3) and low diversity (E7) samples, both before and after correction for the NGS error.

Correction has a larger effect when the (uncorrected) diversity is low, clearly revealing the reduction in diversity from F3 (low DWV level) to E7 (high DWV level). The clonal threshold for the mean shows that the corrected E7 data plausibly has non-zero diversity.


Here we have proposed a simple method for correction of nucleotide distributions containing errors arising from NGS. The correction method assumes an independent site error model. Throughout we have used the simplest error matrix M, the Jukes–Cantor matrix, with the error rate assumed to be constant across the genome and mutation parameter fitted from a standard (known genotype) sample. The method could be extended to use a richer evolutionary model, with parameters tailored to regions of the genome. For instance, the two-parameter Kimura model assumes transversion and transition errors differ, while the Generalized Kimura model adds a further parameter, allowing transversion error probabilities to differ according to transversion direction (Ewens & Grant, 2005). At an extreme, a full 12-parameter model could be used, fitted in sets of three consecutive nucleotides, since four linear equations are available per position. In principle, both the statistical nature of the error mechanism of the NGS platform and errors introduced in sample generation (e.g., PCR) should be captured in the model used.

A geometric view of the correction process is shown in Fig. 6. The tetrahedron is the locus of all probability mass functions on the four points {A, C, G, T}, with the vertices corresponding to situations where the probability mass is on a single nucleotide and the diversity is zero. The centroid c = (0.25, 0.25, 0.25, 0.25) corresponds to the situation where mass is spread equally across nucleotides, and the diversity takes its maximum value of one. When the true nucleotide distribution is p (e.g., p = (1, 0, 0, 0) in Fig. 6) and the NGS error rate is α, the limiting distribution following NGS (based on sequencing a very large number of nucleotides) is q = Mp = (1−4α) p + 4αc, along the line between p and c. In practice we sequence finitely many nucleotides n, giving an empirical distribution qˆ close to q. The correction of qˆ, rˆ=qˆ4αc/14α, reverses the p to q contraction with rˆ lying along the extension of the line from c to qˆ. (If this rˆ lies outside the simplex, rˆ is taken to be the nearest point in the simplex to its projection onto the plane determined by its non-negative components.) In general, the further p is from c, the greater the distance from p to q, so the greater the distance from qˆ to rˆ; the greater the error, the greater the correction. In turn, diversity increases from zero to one along the line from a tetrahedron vertex to the centroid, and is convex down, as shown in Fig. 7, so changes more quickly when the diversity is low and is relatively stable when diversity is high. In summary,

  1. The method corrects more when more correction is called for, and

  2. The method corrects more where the benefit of correction is greater.

The upshot is that bias inherent in the uncorrected diversity values is rectified by the correction process.

A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.

Figure 6: A geometric view of the correction process, showing the tetrahedron of all nucleotide distributions.

When the true nucleotide distribution p is (1, 0, 0, 0), the limiting distribution after next generation sequencing q is along the line from p to the centroid c. Under sampling, we see nearby qˆ which is corrected to rˆ, lying on the extension of the line from c to qˆ.
The diversity profile, from vertex to centroid.

Figure 7: The diversity profile, from vertex to centroid.

The increase is greatest at the vertex and least at the centroid.

We caution that error remains, even after correction. In the light of the geometry, worst-case remaining error occurs for p a vertex. In this case Fig. 2 provides information about the worst-case error rate βˆ remaining after correction. This decreases as α decreases (since lower α introduces lower error) and decreases as the coverage n increases (since as n increases, qˆ becomes closer to q).

The diversity testing and estimation stage uses a diversity correlation structure. We have assumed serial correlation of diversities, since this was seen in the samples considered. Other correlation structures could be employed, such as a structure that decays more rapidly with separation.

The method is able to correct in multi-genome situations which appear to lie beyond the reach of other methods. All correction methods in the literature known to the authors are variants of alignment of reads to a reference genome and the taking of a consensus. If the original sample is a mix of two genomes then such methods are unable to correct at sites of variation (since the minor variant will generally be corrected to the major variant), whereas the proposed method estimates the error rate (using a calibration sample) and so is able to correct, even at sites of variation.

Finally, it may be possible to use the “clonal threshold” to detect SNPs or the more general SNVs. This is a topic for future research.


Movement of a virus to a new host can trigger changes in the level of diversity of variants of the virus. Such diversity changes can be detected using next generation sequencing of the viral mix, but error introduced in the process may mask a diversity change. We have presented a method for estimating measurement-error-corrected diversity from NGS data, both at single nucleotide sites and along sections of a genome. A simple but effective method is used to correct for NGS errors in the nucleotide distribution. With these corrected estimates we are able to compare diversity between samples, assessing whether they are consistent with equal diversity, and also assess whether a diversity estimate is consistent with a clonal population. We demonstrated our method within the context of viral diversity in bees that either have low viral load or high viral load, showing that their viral diversity is significantly different and that under high viral load the viral population is near-clonal. Correction for NGS error was particularly important when applied to the sample with low diversity, since the diversity estimate decreased by a factor of more than five. This analysis has demonstrated that the Varroa mite has a significant effect on the DWV population in honeybees, due to the emergence of a virulent strain that is preferentially replicated after transmission by Varroa. This results in the striking reduction in viral diversity observed in Varroa-exposed pupae and symptomatic worker bees. Our method is applicable to any population structure analysis using NGS data and thus adds a valuable tool to the study of selection pressure and differential fitness within populations.