Error correction and diversity analysis of population mixtures determined by NGS
 Published
 Accepted
 Received
 Academic Editor
 Abhishek Kumar
 Subject Areas
 Biodiversity, Bioinformatics, Computational Biology, Genetics, Mathematical Biology
 Keywords
 Calibration, Error correction, Honeybee, Metagenome, Nucleotide diversity, Standard sample, Viral mix
 Copyright
 © 2014 Wood et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2014. Error correction and diversity analysis of population mixtures determined by NGS. PeerJ 2:e645 https://doi.org/10.7717/peerj.645
Abstract
The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/, the Warwick University Systems Biology Centre software download site.
Introduction
Next generation sequencing (NGS) is an extremely powerful tool for the analysis of mixed populations in ecology and biology, providing a means to assess community composition (metagenomics) whilst also aiding the discovery of new species (Radford et al., 2012). The relatively high error rate of NGS, however, limits the ability to analyze population diversity directly. Population diversity is a crucial ecological measure, relating particularly to selection pressure and (relative) evolutionary fitness. Diversity is critical within biological arenas as well, for instance T cell and viral diversity are key complementary measures of the immune system and infection status respectively. There is thus a pressing need to create tools to appropriately correct NGS error, to estimate diversity introduced by NGS and to estimate inherent virus diversity.
This study was motivated by the prevalence and pathogenicity of honeybee deformed wing virus (DWV) and related viruses. It is known that these may cause both asymptomatic lowlevel and symptomatic highlevel infection in honeybees; the Varroa mite is the likely causal factor that produces a shift from the benign to the pathogenic state, correlating with a shift in the level of viral population diversity (Martin et al., 2012; Ryabov et al., 2014). It was the particular need to test for a change in honeybee viral population diversity that stimulated the current work; for this reason, datasets used here relate to honeybee deformed wing virus populations.
The Varroa mite is now endemic in honeybee colonies throughout Europe, North America and New Zealand, causing significant colony losses (Neumann & Carreck, 2010). Varroa acts as a vector for a range of honeybee viral pathogens, most important of which is the picornalike deformed wing virus, a close relative of Varroa destructor virus1 (VDV1). At high Varroa levels a virus most closely resembling a recombinant between published DWV and VDV1 sequences is amplified in the colony, leading to overt deformed wing disease and colony loss (Ryabov et al., 2014). Thus, to determine the degree of selection for that strain there is a need to compare viral diversity between Varroafree and Varroainfested honeybee viral RNA samples. Viral RNA samples, extracted from individual newlyemerged adult bees from Varroafree or Varroainfested colonies, were subjected to nextgeneration sequencing (NGS) and the resulting reads used to compare the diversity in the distribution of nucleotides at each nucleotide position, from one sample to the other. Next generation sequencing, however, is subject to error which in turn inflates diversity measures. An accurate measure of nucleotide diversity at each position is needed in this context, and thus the statistical structure of the nucleotide distribution can be used to achieve a means for correction.
An extensive literature on the correction of next generation sequencing error is available, entirely devoted to methods for correction of individual nucleotides rather than the distribution of nucleotides at each position. A useful recent review of the existing literature can be found in Yang, Chockalingam & Aluru (2013), where it is pointed out that all methods depend on “alignment and consensus”, with more powerful approaches being developed over time. Methods are classified into three types: kspectrum based (all kmers are identified, then clustered by Hamming distance and clusters assumed to come from the same genomic location), suffix tree/arraybased (a generalisation of the kmer approach which accommodates multiple values of k) and methods dependent on a multiple sequence alignment. Early methods (Quince et al., 2009; Zagordi et al., 2011) assumed that errors occur at random while Skums et al. (2012) builds in known platform influences. In Macalalad et al. (2012) the authors utilised the fact that biological variants and process errors exhibit different covariation. Here we develop a simple but effective method for nucleotide distribution correction. This is a problem at a coarser level than rectification of the assignation of each nucleotide, allowing us to use a different approach. Additionally, the method appears to be unique in that it uses a standard sequence for the purposes of error calibration.
A careful examination of the distribution of Shannon diversity first appeared in the ecology literature (Hutcheson, 1970). Beerenwinkel et al. (2012) provides an excellent overview of the stateoftheart in estimating viral diversity from nextgeneration sequencing data; diversity can be measured at single sites (for single nucleotide variant detection), locally in windows of a multiple sequence alignment and globally over the entire genomic region. In Gregori et al. (2014) the authors look at three measures of diversity and compare their values (the variance of diversity is incorrectly stated in Gregori et al. (2014), but corrected here in the Diversity analysis section). To conclude the review of literature on a broader note, in Schreiber & Brown (2002) a method for correcting systematic bias in the nucleotide distribution (the “distortion”) in a genome is presented.
The novel stochastic method for NGS error correction of diversity measures introduced here is based on use of an evolutionary model (in the first instance, the Jukes–Cantor model (Jukes & Cantor, 1969)) and requires the availability of an accurately known sequenced sample. The corrected nucleotide distribution is used to calculate both the expected diversity and the diversity variance. These in turn can be used to test hypotheses about single diversities, or to test the hypothesis of equality of diversity across two samples. Hypotheses about mean diversities and comparisons of mean diversities of two samples are also discussed, taking into account correlation of diversities across the averaged sites. Finally, the ideas are used to produce a diversity threshold for a sample to be consistent with a clonal (zero nucleotide diversity) population. The methodology is illustrated with NGS datasets drawn from the honeybee study. The methods discussed extend our understanding of NGS error correction and provide a core set of tools useful in the study of diversity.
Materials and Methods
Error and diversity correction
We first correct the NGS error in the nucleotide distribution and then use this distribution to calculate the (corrected) diversity. Our starting point is the true (to be estimated) nucleotide distribution, p. Assuming independence of errors, the error introduced through NGS can be modelled by a 4 × 4 matrix M whose entries give the probability of a particular nucleotide change during sequencing, conditional on the initial value of the nucleotide, whence the theoretical distribution of nucleotides following sequencing is q = Mp. In practice we sequence finitely many nucleotides n so the empirical distribution is $\stackrel{\u02c6}{q}$ with $E\left(\stackrel{\u02c6}{q}\right)=q$. The simplest error model is the Jukes–Cantor model, where a nucleotide mutates to any other with equal (low) probability α, summarised as $M\phantom{\rule{0ex}{0ex}}=\phantom{\rule{0ex}{0ex}}M\left(\alpha \right)=\left[\begin{array}{cccc}\hfill P\left(AA\right)\hfill & \hfill P\left(AC\right)\hfill & \hfill P\left(AG\right)\hfill & \hfill P\left(AT\right)\hfill \\ \hfill P\left(CA\right)\hfill & \hfill P\left(CC\right)\hfill & \hfill P\left(CG\right)\hfill & \hfill P\left(CT\right)\hfill \\ \hfill P\left(GA\right)\hfill & \hfill P\left(GC\right)\hfill & \hfill P\left(GG\right)\hfill & \hfill P\left(GT\right)\hfill \\ \hfill P\left(TA\right)\hfill & \hfill P\left(TC\right)\hfill & \hfill P\left(TG\right)\hfill & \hfill P\left(TT\right)\hfill \end{array}\right]=\left[\begin{array}{cccc}\hfill 13\alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill \\ \hfill \alpha \hfill & \hfill 13\alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill \\ \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill 13\alpha \hfill & \hfill \alpha \hfill \\ \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill 13\alpha \hfill \end{array}\right].$ Since Mp = p + α(1−4p), this transformation has the effect of moving all probabilities closer to 0.25, thereby increasing diversity. Richer mutation models can be incorporated, for example allowing different mutation rates between purines and pyrimidines; for this study the Jukes–Cantor model is sufficient for our purposes.
The NGS error can be partly corrected by reversing the mutation process, first multiplying $\stackrel{\u02c6}{q}$ by M^{−1}. In practice, thanks to the sampling variation introduced during NGS, the reversal can yield negative components. For this reason we must bring these back to zero and then normalize so that the nucleotide proportions sum to one. Thus, the corrected nucleotide distribution $\stackrel{\u02c6}{r}$ is the componentwise maximum of ${M}^{1}\left(\stackrel{\u02c6}{q}\right)$ and (0, 0, 0, 0), denoted ${\left({M}^{1}\stackrel{\u02c6}{q}\right)}_{+}$, normalised, so $\stackrel{\u02c6}{r}={\left({M}^{1}\stackrel{\u02c6}{q}\right)}_{+}/\sum _{i=1}^{4}{\left({\left({M}^{1}\stackrel{\u02c6}{q}\right)}_{+}\right)}_{i}$. Since ${M}^{1}\left(\stackrel{\u02c6}{q}\right)=\left(q\alpha \right)/\left(14\alpha \right)$, this transformation has the effect of moving all probabilities away from 0.25, so decreasing diversity. Only when $\stackrel{\u02c6}{q}=Mp$ does this reversal process work exactly, returning p. Figure 1A lays out these stages, from p, the true nucleotide distribution in the organism, to $\stackrel{\u02c6}{q}$ the distribution following NGS, to $\stackrel{\u02c6}{r}$, the corrected distribution. To illustrate this numerically, with p = (1, 0, 0, 0), α = 0.001 and n = 10,000 then (n_{1}, n_{2}, n_{3}, n_{4}) could be (9989, 4, 3, 4) (using a multinomial distribution) whence $\stackrel{\u02c6}{q}=\left(0.9989,0.0004,0.0003,0.0004\right)$, ${M}^{1}\stackrel{\u02c6}{q}=\left(1.0019,0.0006,0.0007,0.0006\right)$, ${\left({M}^{1}\stackrel{\u02c6}{q}\right)}_{+}=\left(1.0019,0,0,0\right)$, and $\stackrel{\u02c6}{r}=\left(1,0,0,0\right)=p$. In this example the correction is exact, but this is not always the case, whence the estimate $\stackrel{\u02c6}{r}$ will still be subject to error.
To utilize this correction method, we need the matrix M; in the case of the Jukes–Cantor model this requires an estimate of the mutation rate α. This can be achieved by estimating it on a sample of known composition and genotype, denoted here the “standard” sample. This provides a twostep method to correct an empirical nucleotide distribution $\stackrel{\u02c6}{q}$ for NGS error. The first (calibration) step uses p_{ standard} and ${\stackrel{\u02c6}{q}}_{\mathrm{\u2008standard}}$ to estimate α (as $\stackrel{\u02c6}{\alpha}$, so giving an estimate $\stackrel{\u02c6}{M}$ of M) while the second (correction) step uses $\stackrel{\u02c6}{M}$ and $\stackrel{\u02c6}{q}$ to estimate p with $\stackrel{\u02c6}{r}$, as follows:

Calibration Step. This is the estimation of α, the error rate in NGS sequencing. We begin with a sample in which the nucleotide mix p_{ standard} = (p_{1},p_{2},p_{3},p_{4})^{T} is known at each nucleotide position along a genome. We then sequence the mix using NGS, giving nucleotide mix ${\stackrel{\u02c6}{q}}_{\mathrm{\u2008standard}}={\left({\stackrel{\u02c6}{q}}_{1},{\stackrel{\u02c6}{q}}_{2},{\stackrel{\u02c6}{q}}_{3},{\stackrel{\u02c6}{q}}_{4}\right)}^{T}$. For each site we can solve for $\stackrel{\u02c6}{\alpha}$ using $\stackrel{\u02c6}{q}=\stackrel{\u02c6}{M}p$. Equating ith components provides estimates ${\stackrel{\u02c6}{\alpha}}_{i}=\left({\stackrel{\u02c6}{q}}_{i}{p}_{i}\right)/\left(14{p}_{i}\right)$ for i = 1, …, 4; these can be averaged across nucleotides to give a lower variance estimate of α, (1)$\stackrel{\u02c6}{\alpha}=\frac{{\stackrel{\u02c6}{\alpha}}_{1}+{\stackrel{\u02c6}{\alpha}}_{2}+{\stackrel{\u02c6}{\alpha}}_{3}+{\stackrel{\u02c6}{\alpha}}_{4}}{4}=\frac{1}{4}\sum _{i=1}^{4}\frac{{\stackrel{\u02c6}{q}}_{i}{p}_{i}}{14{p}_{i}}.$ These estimates can be averaged across nucleotide positions to improve the estimate further.

Correction Step. Given nucleotide distributions $\stackrel{\u02c6}{q}$ produced from NGS of a sample, form the corrected counts $\stackrel{\u02c6}{r}={\left({\stackrel{\u02c6}{M}}^{1}\stackrel{\u02c6}{q}\right)}_{+}/\sum _{i=1}^{4}{\left({\left({\stackrel{\u02c6}{M}}^{1}\stackrel{\u02c6}{q}\right)}_{+}\right)}_{i}$, where $\stackrel{\u02c6}{M}=M\left(\stackrel{\u02c6}{\alpha}\right)$.
Example 1 in the Results section illustrates these calibration and correction steps with a numerical example.
It is possible to estimate an upper bound for the residual error rate remaining after correction, given an NGS error rate α and coverage n. This is done by using the calibration step with p_{ standard} = (1, 0, 0, 0) and ${\stackrel{\u02c6}{q}}_{\mathrm{\u2008standard}}=\stackrel{\u02c6}{r}$, denoting the estimated residual error rate by $\stackrel{\u02c6}{\beta}$. Here $\stackrel{\u02c6}{r}$ is found by first generating n values using q = M(α) p, then correcting using rate α. A graph of $\stackrel{\u02c6}{\beta}$ against α is given in Fig. 2, for the case p_{ standard} = (1, 0, 0, 0) and n = 1,000, 2,000, 5,000, 10,000 and 20,000. This demonstrates, for these parameters, that correction reduces the error by a factor of over 10 for a coverage of n = 1,000. That this is an upper bound is made clear in the later Discussion section.
Given a nucleotide probability mass distribution p = (p_{1}, p_{2}, p_{3}, p_{4}) the nucleotide diversity is $H=\sum _{i=1}^{4}{p}_{i}{log}_{4}{p}_{i}$. This can be thought of as the geometric mean $\prod _{i}{p}_{i}^{{p}_{i}}$ of the distribution probability masses, transformed by taking the natural logarithm of the reciprocal; then no diversity, for example (1,0,0,0), has H = 0 while complete diversity, a uniform distribution of (0.25, 0.25, 0.25, 0.25), has H = 1. Figure 3 shows the graph of a component −xlog_{4}x of diversity. The steep slope at the extremes, particularly at zero, shows that correction to a component at these extremes has most effect on diversity measurement. Of interest here is detection of clonality, where H = 0, so correction of very small probabilities is critical.
With the corrected nucleotide distribution available, the Shannon diversity can be estimated as (2)$\stackrel{\u02c6}{H}=H\left(\stackrel{\u02c6}{r}\right)=\sum _{i=1}^{4}\stackrel{\u02c6}{{r}_{i}}{log}_{4}\stackrel{\u02c6}{{r}_{i}}.$ Figure 1B schematically shows the relative levels of H(p) (true diversity), $H\left(\stackrel{\u02c6}{q}\right)$ (uncorrected diversity) and $H\left(\stackrel{\u02c6}{r}\right)$ (corrected diversity). We conclude this section by remarking that the data used for calibration should ideally be sequenced under similar conditions to the data which is to be corrected.
Diversity analysis
Two questions are key in an analysis of population diversity. Firstly, whether the diversity H is equal to a given value ℋ and secondly, whether two populations have the same diversity, or alternatively whether one is more diverse than the other. Specifically, we want to test null hypotheses, firstly of the form H_{0}: H = ℋ and secondly of the form H_{0}: H_{1} = H_{2}, for two samples labeled ‘1’ and ‘2’. We will examine both of these tests for diversity measured at a single nucleotide position and for the mean diversity across segments comprising multiple (N) possibly correlated sites.
In all cases we will have one or more true nucleotide distributions p, true post NGS nucleotide distributions q = Mp, estimates $\stackrel{\u02c6}{q}$ determined using a sample of n nucleotides drawn from q, so $\stackrel{\u02c6}{{q}_{i}}={n}_{i}/n$ for each i = 1, …, 4, where n_{i} is the number of draws resulting in the ith nucleotide and finally, corrected distributions $\stackrel{\u02c6}{r}$. Given $\stackrel{\u02c6}{r}$, $\stackrel{\u02c6}{H}=H\left(\stackrel{\u02c6}{r}\right)$, as in (2), estimates the true diversity $H=\sum _{i}{p}_{i}{log}_{4}{p}_{i}$ with H evaluated at corrected values $\stackrel{\u02c6}{r}$ of estimates $\stackrel{\u02c6}{q}$ of true values after error introduction, q. When n is large, this approximation will be close to the true value. In general, since the diversity function is concave, diversity is increased when we replace p with q and decreased when we replace q with $\stackrel{\u02c6}{q}$ (since $\stackrel{\u02c6}{H}$ is known to underestimate H (Hutcheson, 1970)) and then $\stackrel{\u02c6}{r}$ (since $\stackrel{\u02c6}{r}$ corrects $\stackrel{\u02c6}{q}$). This compensation is fortuitous.
The variance of Shannon diversity $\mathrm{Var}\left(\stackrel{\u02c6}{H}\right)$ at a nucleotide position with coverage n and empirical distribution $\stackrel{\u02c6}{r}=\left({\stackrel{\u02c6}{r}}_{1},{\stackrel{\u02c6}{r}}_{2},{\stackrel{\u02c6}{r}}_{3},{\stackrel{\u02c6}{r}}_{4}\right)$ is approximated to first order (Hutcheson, 1970) with the following expression, (3)$\mathrm{Var}\left(\stackrel{\u02c6}{H}\right)\approx \frac{1}{n}\left(\sum _{i=1}^{4}{\stackrel{\u02c6}{r}}_{i}{\left({log}_{4}{\stackrel{\u02c6}{r}}_{i}\right)}^{2}{\left(\sum _{i=1}^{4}{\stackrel{\u02c6}{r}}_{i}{log}_{4}{\stackrel{\u02c6}{r}}_{i}\right)}^{2}\right).$ This can be shown using the Δmethod and that $\mathrm{Var}\left({\stackrel{\u02c6}{r}}_{i}\right)={r}_{i}\left(1{r}_{i}\right)/n$ and $\mathrm{Cov}\left({\stackrel{\u02c6}{r}}_{i}{\stackrel{\u02c6}{r}}_{j}\right)={r}_{i}{r}_{j}/n$, where $E\left({\stackrel{\u02c6}{r}}_{i}\right)={r}_{i}$ for i = 1, …, 4.
For testing and estimation we also need the variance of the mean of diversities averaged across a segment of sites. For ease of presentation (though it is readily generalised), we now assume that (3) provides the variance of estimated diversity at each nucleotide position, denoted s^{2}. Empirical evidence shows that diversity is correlated across positions. We assume an exponential decay of the diversity correlation, with correlation between nucleotide positions i and j given by ρ^{j−i+1}. Then the average diversity across N consecutive nucleotide positions, denoted $\stackrel{\u0304}{H}$, has variance $\mathrm{Var}\left(\stackrel{\u0304}{H}\right)$ given by (4)$\mathrm{Var}\left(\stackrel{\u0304}{H}\right)\approx \frac{{s}^{2}}{N}\left(1+2\frac{N1}{N}\frac{\rho}{1\rho}\frac{2}{{N}^{2}}{\left(\frac{\rho}{1\rho}\right)}^{2}\left(1{\rho}^{N1}\right)\right)$ (5)$=\frac{{s}^{2}}{N}\left(1+\frac{2\rho}{1\rho}\right)\phantom{\rule{1em}{0ex}}\text{for}N\text{large}.$ We estimate ρ using the correlation between adjacent diversities (the onestep lagged correlation) along the segment of interest. With these variance estimates we can conduct hypothesis tests and construct confidence intervals in all cases considered. These are summarised in Table 1 and illustrated numerically in Example 2 in “Results”.
Number of samples  

1  2  
Hypothesis  Test statistic z  Hypothesis  Test statistic z 
μ_{H} = ℋ  $\frac{\stackrel{\u02c6}{H}\mathscr{H}}{\sqrt{\mathrm{Var}\left(\stackrel{\u02c6}{H}\right)}}$  μ_{H1}−μ_{H2} = 0  $\frac{{\stackrel{\u02c6}{H}}_{1}{\stackrel{\u02c6}{H}}_{2}}{\sqrt{\mathrm{Var}\left({\stackrel{\u02c6}{H}}_{1}\right)+\mathrm{Var}\left({\stackrel{\u02c6}{H}}_{2}\right)}}$ 
[$\stackrel{\u02c6}{H}\pm {z}_{0.975}\sqrt{\mathrm{Var}\left(\stackrel{\u02c6}{H}\right)}$  $\stackrel{\u02c6}{{H}_{1}}\stackrel{\u02c6}{{H}_{2}}\pm {z}_{0.975}\sqrt{\mathrm{Var}\left({\stackrel{\u02c6}{H}}_{1}\right)+\mathrm{Var}\left({\stackrel{\u02c6}{H}}_{2}\right)}$] 
Clonal threshold estimation
Suppose that a clonal sample (one with zero diversity) is sequenced by next generation methods with error rate α. This is corrected to level $\stackrel{\u02c6}{\beta}$, as shown in Fig. 2. This level of error remains in the sample, whence an upper 95% threshold for $\stackrel{\u0304}{H}$, the average diversity across N sequential positions, with lagged correlation ρ, is $\stackrel{\u0304}{H}+{z}_{0.95}\sqrt{\mathrm{Var}\left(\stackrel{\u0304}{H}\right)}$ where z_{0.95} = 1.64 is the 95th percentile of the standard normal distribution. Note that this threshold depends on both the coverage n and the segment length N. A larger n typically decreases $\stackrel{\u0304}{H}$ and so the threshold level, while a larger N decreases the width of the confidence interval around $\stackrel{\u0304}{H}$, so also decreases the threshold level.
Sequence data
Three viral samples were sequenced by next generation sequencing (Illumina HiSeq1000). These comprised a sample that could be used for estimation of the error (standard sample), and two experimental samples, from low viral load and high viral load bees. After trimming adapters and barcodes, the first reads (the reads are paired), 101nts long, were used to calculate the NGS error rate and illustrate the methods.
Standard sample: an accurately known mix of two viral recombinants, X59
This was composed of two of the DWVlike viral recombinant RNA genomes described in Moore et al. (2011), VDV1_{DV D} and VDV1_{V V D} (NCBI Accession Nos. HM067437 and HM067438 respectively). The mixture was produced by in vitro RNA transcription using linearized plasmid clones with fulllength cDNA inserts of VDV1_{DV D} and VDV1_{V V D} (Moore et al., 2011) with the T7 mMESSAGEmMACHINE kit (Ambion). The RNA transcripts were purified using RNAeasy columns (Quiagen), quantified, mixed 25% VDV1_{DV D}, 75% VDV1_{V V D} and sequenced using the Illumina platform protocol. The NGS reads are available in the EBI Sequence Read Archive, study accession PRJEB5249, ERS395188. The reads were aligned using Bowtie 2 (Langmead & Salzberg, 2012) to a single reference sequence (VDV1_{DV D}). SAMtools mpileup (Li et al., 2009) was used to produce the number of nucleotides covering each position in the reference. We excluded positions of mismatches between the VDV1_{DV D} and VDV1_{V V D} sequences, about 4% of the genome length. This produced nucleotide pileups where all diversity was a result of the methods used during NGS sequence acquisition. This dataset is used for calibration of NGS error.
A viral population mix from low DWV level (F3) and from high DWV level (E7) honeybees
Samples from honeybee nurse bees for which qRTPCR showed either low DWV levels (EBI SRA study accession PRJEB5249, ERS395182) or high DWV levels (EBI SRA study accession PRJEB5249, ERS395180) were used. In each case, in order to assemble the required nucleotide distribution data, the following computational steps were carried out. “First reads” of length 101nt were aligned using Bowtie 2 to DWV (NC_004830) and VDV1 (NC_006494) reference sequences. SAMtools mpileup was then used to produce the number of nucleotides covering each position in the references.
Deformed wing virus
A map of the deformed wing virus adapted from Lanzi et al. (2006) is shown in Fig. 4. In the Results section we select a single site in the helicase region of the genome (Examples 2.1 and 2.2), and average across the capsid region (Examples 2.3 and 2.4).
Results
Example 1. Calibration and error correction (i) Calibration—estimation of the NGS error rate α. Nucleotide pileups resulting from the X59 mix of two viral recombinants were used for calibration of the NGS error rate. Nucleotide positions where DWV and VDV1 are identical were selected; for these the true distribution is one of p = (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) or (0, 0, 0, 1). For example, the upper half of Table 2 shows the pileup counts for the first three capsid (see Fig. 4) nucleotide sites. Since the p here place all weight on one nucleotide, Eq. (1) reduces, when p = (1, 0, 0, 0), for example, to $\left({\stackrel{\u02c6}{q}}_{2}+{\stackrel{\u02c6}{q}}_{3}+{\stackrel{\u02c6}{q}}_{4}\right)/3$ or the average error proportion. Thus, for the first position in Table 2 we would estimate α as (0.0010 + 0.0016 + 0.0008)/3 or 0.0011; for the second and third positions the corresponding averages are 0.0014 and 0.0007, giving a local average of 0.0011. This was done for the entire capsid region of the genome (2,843nts in length), yielding $\stackrel{\u02c6}{\alpha}=0.001949$, so the overall NGS error rate is estimated at a little under 2 in 1,000.
(ii) Correction of the distributions $\stackrel{\u02c6}{q}$. This was carried out in three ways.

Within the capsid region of X59, using $\stackrel{\u02c6}{\alpha}=0.001949$ calculated using the entire capsid region. Correction of the distribution $\stackrel{\u02c6}{q}$ in the first three capsid positions, shown in the upper half of Table 2, yielded the $\stackrel{\u02c6}{r}$ distributions shown; each is corrected to the exact distribution. (The Excel software available was used to also test the calibrationcorrection routine across a central region in the helicase and a nonstructural region downstream of the helicase; in both cases the degenerate distributions were found with excellent accuracy.)

Within the capsid region of E7, again using $\stackrel{\u02c6}{\alpha}=0.001949$ calculated using the entire capsid region of the independent standard sample X59. The accurate results are shown in the lower half of Table 2.

Within the capsid region of X59, using crossvalidation. The capsid region was divided into sites with odd index (for calibration) and sites with even index (for correction). The calibration calculation here gives $\stackrel{\u02c6}{\alpha}=0.001582$. Two arbitrarily chosen even indices in the correction region (where the true distribution is known to be degenerate) had associated raw nucleotide counts {n_{i}} of (71, 91461, 39, 36) (so $\stackrel{\u02c6}{q}=\left(0.0008,0.9984,0.0004,0.0004\right)$) and (134, 60, 90, 65615) (so $\stackrel{\u02c6}{q}=\left(0.0020,0.0009,0.0014,0.9957\right)$). The corrected distributions $\stackrel{\u02c6}{r}$ are (0, 1, 0, 0) and (0.0005, 0, 0, 0.9995) respectively, showing considerable improvement.
Position no.  A  C  G  T  Diversity  

1  Counts {n_{i}}  87,573  86  142  73  
Proportion $\stackrel{\u02c6}{q}$  0.9966  0.0010  0.0016  0.0008  0.0222  
Proportion $\stackrel{\u02c6}{r}$  1  0  0  0  0  
2  Counts {n_{i}}  90  130  156  86,961  
Proportion $\stackrel{\u02c6}{q}$  0.0010  0.0015  0.0018  0.9957  0.0270  
Proportion $\stackrel{\u02c6}{r}$  0  0  0  1  0  
3  Counts {n_{i}}  68  29  87,998  93  
Proportion $\stackrel{\u02c6}{q}$  0.0008  0.0003  0.9978  0.0010  0.0149  
Proportion $\stackrel{\u02c6}{r}$  0  0  1  0  0  
1  Counts {n_{i}}  13,267  11  16  11  
Proportion $\stackrel{\u02c6}{q}$  0.9971  0.0008  0.0012  0.0008  0.0161  
Proportion $\stackrel{\u02c6}{r}$  1  0  0  0  0  
2  Counts {n_{i}}  11  25  15  13,215  
Proportion $\stackrel{\u02c6}{q}$  0.0008  0.0019  0.0011  0.9962  0.0208  
Proportion $\stackrel{\u02c6}{r}$  0  0  0  1  0  
3  Counts {n_{i}}  8  9  13,440  10  
Proportion $\stackrel{\u02c6}{q}$  0.0006  0.0007  0.9980  0.0007  0.0120  
Proportion $\stackrel{\u02c6}{r}$  0  0  1  0  0 
Example 2: Diversity testing and confidence intervals
2.1 Single site, single population
The nucleotide distribution, in A, C, G, T order, at arbitrarily chosen position 5201 (within the helicase region of the viral genome, Fig. 4) in F3 is (1, 0, 58, 1) giving $\stackrel{\u02c6}{q}=\left(0.0167,0,0.9667,0.0167\right)$. Correction using estimate $\stackrel{\u02c6}{\alpha}=0.001949$ yields corrected proportions $\stackrel{\u02c6}{r}=\text{(0.0148,0,0.9704,0.0148)}$. True diversity is then estimated as $H\left(\stackrel{\u02c6}{r}\right)=0.1110$ using (2) and $\mathrm{Var}\stackrel{\u02c6}{\left(H\right)}=0.0044$ using (3), whence the test statistic z is 1.6814 with pvalue of 0.0463, leading to rejection of the null hypothesis that the true diversity ℋ = 0 at the 5% level. A 95% confidence interval for true diversity is [0, 0.2404].
Sample E7 at nucleotide position 5201 has counts of (38, 35, 26497, 48) giving $\stackrel{\u02c6}{q}=\left(0.0014,0.0013,0.9954,0.0018\right)$ and $\stackrel{\u02c6}{r}=\left(0,0,1,0\right)$. True diversity is estimated as $\stackrel{\u02c6}{H}=0$ and $\mathrm{Var}\stackrel{\u02c6}{\left(H\right)}=0$ (since (3) converges to zero in this case), whence the hypothesis of zero diversity is not rejected. A 95% confidence interval for true diversity is {0}.
2.2 Single site, two populations
We test whether the diversities are equal at nucleotide position 5201 for F3 (first sample) and E7 (second sample), using the data in Example 2.1. The test statistic z described in Table 1 is 1.6814 with pvalue = 0.0463. Hence we reject the null hypothesis of equal true diversities for the two populations at this site, concluding that F3 has higher diversity. A 95% confidence interval for the true diversity difference is [−0.0184, 0.2404].
2.3 Multiple sites, single population
We test whether the mean diversity across the capsid region (see Fig. 4) of the viral genome is zero, for both E7 and F3 (N = 2, 843). For E7, $\stackrel{\u0304}{H}=0.0037$, the serial correlation of diversities is estimated as $\stackrel{\u02c6}{\rho}=0.1728$ and mean site diversity variance is s^{2} = 7.714 × 10^{−7} whence $\mathrm{Var}\left(\stackrel{\u0304}{H}\right)=3.847\times 1{0}^{10}$ using (5), giving z = 187.5 so that the pvalue is negligible. Here the sample size is so large and hence the power so great that we conclude that the true diversity is nonzero. Of greater interest is a 95% confidence interval for the true diversity, namely [0.0036, 0.0037]. Corresponding calculations for the more varied F3 sample give values of $\stackrel{\u0304}{H}=0.0410$, $\stackrel{\u02c6}{\rho}=0.1195$, s^{2} = 0.0019, $\mathrm{Var}\left(\stackrel{\u0304}{H}\right)=8.702\times 1{0}^{7}$, z = 43.98 and again a negligible pvalue. A 95% confidence interval for the true mean diversity value is [0.0392, 0.0429].
2.4 Multiple sites, two populations
We test the null hypothesis that the true mean diversities, over the capsid region, of F3 (first sample) and E7 (second sample) are equal. Using values already given in Example 2.3, we find that z = 40.03, again giving a negligible pvalue. Hence we conclude that the mean diversity of F3 is greater than that of E7. A 95% confidence interval for the true mean diversity difference is [0.0355, 0.0392].
Example 3: Clonal threshold
Using $\stackrel{\u02c6}{\alpha}=0.001949$ and with coverage of n = 20,000 we find that $\stackrel{\u02c6}{\beta}=0.0001233$. This gives $H\left(\stackrel{\u02c6}{\beta}\right)=0.002748$, by direct calculation using $\stackrel{\u02c6}{r}=\text{}{log}_{4}{\left(13\stackrel{\u02c6}{\beta}\right)}^{13\stackrel{\u02c6}{\beta}}3{log}_{4}{\stackrel{\u02c6}{\beta}}^{\stackrel{\u02c6}{\beta}}\text{}$. Then $s=\sqrt{\mathrm{Var}\left(H\left(\stackrel{\u02c6}{\beta}\right)\right)}=0.0008940$ using (3). The 95% clonal diversity threshold for a single site is therefore 0.002748 + 1.64 × 0.0008940 or approximately 0.0042. For mean diversity, with N = 2, 843 and $\stackrel{\u02c6}{\rho}=0.1728$ (appropriate for E7 on the capsid) the 95% clonal mean diversity threshold is $H\left(\stackrel{\u02c6}{\beta}\right)+1.64\left(s/\sqrt{N}\right)\sqrt{12\stackrel{\u02c6}{\rho}/\left(1\stackrel{\u02c6}{\rho}\right)}=0.0028$.
Figure 5 illustrates the confidence intervals found in Example 2.3, labelled “After correction”, together with corresponding confidence intervals for the uncorrected data, labelled “Before correction”. The corrected clonal mean diversity threshold of Example 3 is also shown. The highly significant difference between the Varroafree and Varroainfested nucleotide diversities are evident. Correction of the low diversity sample has a far greater effect on diversity than correction of the high diversity sample, due to the steep slope of the diversity component near zero (Fig. 3). The corrected low diversity sample lies just above the clonal threshold.
Discussion
Here we have proposed a simple method for correction of nucleotide distributions containing errors arising from NGS. The correction method assumes an independent site error model. Throughout we have used the simplest error matrix M, the Jukes–Cantor matrix, with the error rate assumed to be constant across the genome and mutation parameter fitted from a standard (known genotype) sample. The method could be extended to use a richer evolutionary model, with parameters tailored to regions of the genome. For instance, the twoparameter Kimura model assumes transversion and transition errors differ, while the Generalized Kimura model adds a further parameter, allowing transversion error probabilities to differ according to transversion direction (Ewens & Grant, 2005). At an extreme, a full 12parameter model could be used, fitted in sets of three consecutive nucleotides, since four linear equations are available per position. In principle, both the statistical nature of the error mechanism of the NGS platform and errors introduced in sample generation (e.g., PCR) should be captured in the model used.
A geometric view of the correction process is shown in Fig. 6. The tetrahedron is the locus of all probability mass functions on the four points {A, C, G, T}, with the vertices corresponding to situations where the probability mass is on a single nucleotide and the diversity is zero. The centroid c = (0.25, 0.25, 0.25, 0.25) corresponds to the situation where mass is spread equally across nucleotides, and the diversity takes its maximum value of one. When the true nucleotide distribution is p (e.g., p = (1, 0, 0, 0) in Fig. 6) and the NGS error rate is α, the limiting distribution following NGS (based on sequencing a very large number of nucleotides) is q = Mp = (1−4α) p + 4αc, along the line between p and c. In practice we sequence finitely many nucleotides n, giving an empirical distribution $\stackrel{\u02c6}{q}$ close to q. The correction of $\stackrel{\u02c6}{q}$, $\stackrel{\u02c6}{r}=\left(\stackrel{\u02c6}{q}4\alpha c\right)/\left(14\alpha \right)$, reverses the p to q contraction with $\stackrel{\u02c6}{r}$ lying along the extension of the line from c to $\stackrel{\u02c6}{q}$. (If this $\stackrel{\u02c6}{r}$ lies outside the simplex, $\stackrel{\u02c6}{r}$ is taken to be the nearest point in the simplex to its projection onto the plane determined by its nonnegative components.) In general, the further p is from c, the greater the distance from p to q, so the greater the distance from $\stackrel{\u02c6}{q}$ to $\stackrel{\u02c6}{r}$; the greater the error, the greater the correction. In turn, diversity increases from zero to one along the line from a tetrahedron vertex to the centroid, and is convex down, as shown in Fig. 7, so changes more quickly when the diversity is low and is relatively stable when diversity is high. In summary,

The method corrects more when more correction is called for, and

The method corrects more where the benefit of correction is greater.
The upshot is that bias inherent in the uncorrected diversity values is rectified by the correction process.
We caution that error remains, even after correction. In the light of the geometry, worstcase remaining error occurs for p a vertex. In this case Fig. 2 provides information about the worstcase error rate $\stackrel{\u02c6}{\beta}$ remaining after correction. This decreases as α decreases (since lower α introduces lower error) and decreases as the coverage n increases (since as n increases, $\stackrel{\u02c6}{q}$ becomes closer to q).
The diversity testing and estimation stage uses a diversity correlation structure. We have assumed serial correlation of diversities, since this was seen in the samples considered. Other correlation structures could be employed, such as a structure that decays more rapidly with separation.
The method is able to correct in multigenome situations which appear to lie beyond the reach of other methods. All correction methods in the literature known to the authors are variants of alignment of reads to a reference genome and the taking of a consensus. If the original sample is a mix of two genomes then such methods are unable to correct at sites of variation (since the minor variant will generally be corrected to the major variant), whereas the proposed method estimates the error rate (using a calibration sample) and so is able to correct, even at sites of variation.
Finally, it may be possible to use the “clonal threshold” to detect SNPs or the more general SNVs. This is a topic for future research.
Conclusion
Movement of a virus to a new host can trigger changes in the level of diversity of variants of the virus. Such diversity changes can be detected using next generation sequencing of the viral mix, but error introduced in the process may mask a diversity change. We have presented a method for estimating measurementerrorcorrected diversity from NGS data, both at single nucleotide sites and along sections of a genome. A simple but effective method is used to correct for NGS errors in the nucleotide distribution. With these corrected estimates we are able to compare diversity between samples, assessing whether they are consistent with equal diversity, and also assess whether a diversity estimate is consistent with a clonal population. We demonstrated our method within the context of viral diversity in bees that either have low viral load or high viral load, showing that their viral diversity is significantly different and that under high viral load the viral population is nearclonal. Correction for NGS error was particularly important when applied to the sample with low diversity, since the diversity estimate decreased by a factor of more than five. This analysis has demonstrated that the Varroa mite has a significant effect on the DWV population in honeybees, due to the emergence of a virulent strain that is preferentially replicated after transmission by Varroa. This results in the striking reduction in viral diversity observed in Varroaexposed pupae and symptomatic worker bees. Our method is applicable to any population structure analysis using NGS data and thus adds a valuable tool to the study of selection pressure and differential fitness within populations.