PopAlu : population-scale discovery of Alu polymorphisms

Alu elements are sequences of approximately 300 basepairs that combined comprise more than 10% of the human genome. Due to their recent origin in primate evolution some Alu elements are polymorphic in humans, present in some individuals while absent in others. We present PopAlu, a tool to detect polymorphic Alu elements on a population scale from paired-end sequencing data. PopAlu uses read pair distance and orientation as well as split reads to identify the location and precise breakpoints of polymorphic Alus. Genotype calling enables us to differentiate between homozygous and heterozygous carriers, making the output of PopAlu suitable for use in downstream analyses such as genome-wide association studies (GWAS). We show on a simulated dataset that PopAlu calls Alu elements inserted and deleted with respect to a reference genome with high accuracy and high precision. Our analysis of real data of a human trio from the 1000 Genomes Project confirms that PopAlu is able to produce highly accurate genotype calls. To our knowledge, PopAlu is the first tool that identifies polymorphic Alu elements from multiple individuals simultaneously, pinpoints the precise breakpoints and calls genotypes with high accuracy.


INTRODUCTION
Population-wide identification of variation has recently become possible through falling costs of DNA sequencing.The list of whole-genome sequencing projects with large numbers of individuals is constantly growing (Gudbjartsson et al., 2015;Genome of the Netherlands Consortium et al., 2014;Siva, 2008) due to its potential to characterize genetic variation and advance medical research.
Alu elements are a substantial source of structural variation in human genomes.This class of active mobile elements (MEs) is abundant in all primate species and comprises more than 10% of the human genome (Cordaux and Batzer, 2009).The Alu sequences are approximately 300 bp long with a dimeric structure separated by a short A-rich region, each monomer being derived from the 7SL RNA gene.
Although Alu elements do not encode genes, many studies suggest their functional importance.
Alu elements are recognized to affect protein synthesis at the transcriptional and post-transcriptional level (Sorek et al., 2002;Kelley et al., 2014) as well as DNA methylation (de Andrade et al., 2011) and other cellular processes (Deininger and Batzer, 1999).Furthermore, they are thought to be major drivers of genome evolution (Hormozdiari et al., 2013;Salem et al., 2003) and assist in the creation of structural variation (Wang et al., 2006).The importance of Alu elements is further highlighted by the potential association with genetic instability, one of the principal causative factors in many disorders including cancer (Deininger and Batzer, 1999;Zhang et al., 2011;Helman et al., 2014).
Alu elements have been inserted into the human genome at more than one million locations over the last 65 million years (mya).The majority of amplifications happened early in primate evolution.The estimated current rate of Alu retrotransposition is approximately one per generation, which is at least 100-fold slower than at the peak of amplification that occurred 30 − 50 mya ago (Batzer and Deininger, 2002;Kapitonov and Jurkal, 1996;Witherspoon et al., 2013).

PrePrints
The evolutionary history of Alu elements led to two broad categories, to fixed Alu elements and polymorphic Alu elements.Fixed Alu elements are present in the entire population and, thus, are presumably evolutionarily older.Their locations are largely known from the reference genome.In contrast, polymorphic Alu elements appear only in a subset of the population and, hence, are likely the result of more recent retrotransposition events.The reference genome contains only some of these newly inserted Alus and the locations of many other polymorphic Alu elements are still unknown.
Identification of polymorphic Alu elements from sequencing data can be divided into two problems, the detection of Alu deletions and the detection of Alu insertions.The goal of the Alu deletion problem is to find Alu elements present in the reference but not in a sequenced genome.The goal of the Alu insertion problem is to find Alu elements missing in the reference but present in a sequenced genome.We acknowledge that even the Alu elements found by solving the deletion problem have most likely been inserted during evolution.
While a large number of methods have been developed to determine other types of variation from sequencing data, such as SNPs and small indels, comparatively fewer methods have been developed for finding Alu polymorphisms.Notable exceptions are Alu-Detect (David et al., 2013), VariationHunter (Hormozdiari et al., 2010, 2013) RetroSeq (Keane et al., 2013), Tangram (Wu et al., 2014), and Mobster (Thung et al., 2014).These methods focus mainly on the detection of Alu insertions, and generally follow a three-step analysis.First, they identify fragments (reads or read pairs) that indicate the occurrence of an Alu insertion.Next, they cluster these fragments along the genome, such that each cluster includes a potential insertion.Last, for each sequenced genome and at each cluster, they calculate a likelihood that an Alu element has actually been inserted given the set of fragments.
In this paper, we describe the tool PopAlu for population-wide detection of Alu polymorphisms.
PopAlu is the successor of our previous tool PAIR (Sveinbjörnsson and Halldórsson, 2012) with a number of improvements and an extension to handle many individuals simultaneously.It follows the three-step approach and starts by identifying read pairs indicating an Alu polymorphism.As opposed to most other methods, PopAlu constructs clusters in the second step using read pairs from many individuals simultaneously.Pooling of data across many individuals increases detection power even for polymorphisms of low frequency.Further, PopAlu pinpoints the precise insertion breakpoints instead of reporting only approximate locations.In the last step, PopAlu applies a probabilistic approach for calling genotypes.It can differentiate between homozygous and heterozygous calls, while many other tools either do not report heterozygous calls or make calls simply based on the counts of supporting fragments.Our implementation of PopAlu is easy to use in that it is almost parameter-free -most of the parameters are automatically inferred from the input data -and it is a stand-alone package implemented using the SeqAn C++ library (Döring et al., 2008) without any further requirements of external tools.

METHODS
The input of PopAlu is a reference genome and a binary alignment (BAM) file of paired-end sequencing reads of a donor individual (or a set of individuals).

Definitions
A read pair r has a left read r L and right read r R , which are mates to each other, denoted as mate(r L ) = r R and mate(r R ) = r L .We use r N to denote either a left or a right read when the relative position in the pair is not relevant.If r N is mapped to the reference genome, we use begin(r N ) and end(r N ) to represent its start and end position in the mapped reference genome.We say that a read is concordant if the two ends are mapped to opposite strands on the same chromosome within a close distance of each other and otherwise discordant.We define the insert length of a read pair r, measured with respect to the reference genome, as Y (r) = end(r R ) − begin(r L ).We approximate the empirical distribution of Y from all concordant read pairs, stratifying Y by sequencing library.We let E[Y ] denote the mean of Y and σ (Y ) the standard deviation.We refer to the bounds of the Alu region as breakpoints.Figure 1 shows an example Alu region bounded by a left breakpoint AL and a right breakpoint AR.

Alu deletion
Given the reference aligned sequencing data of a single individual and a set of known Alu elements in the reference genome, the objective of the Alu deletion problem is to examine each of the given Alu elements

PrePrints
for the existence of an Alu deletion.The input set of Alu elements can be determined using various tools, e.g.RepeatMasker (http://www.repeatmasker.org).
For each Alu element in the reference genome, we distinguish the two haplotypes H 0 and H 1 ; H 0 denotes the presence and H 1 denotes the absence of the Alu element with respect to the reference genome.Given these haplotypes, three autosomal genotypes are possible for an individual: homozygote Alu (G 0 ), heterozygote (G 1 ) and homozygote non-Alu (G 2 ).Our goal is to compute for each individual the relative likelihoods of the genotypes given the sequencing data.
Our algorithm considers each Alu element in the reference genome separately.Given an Alu sequence in the reference starting at position AL and ending at AR, we restrict our attention to the region containing the Alu [AL, AR] and flanking regions [FL, AL] and [AR, FR] on both sides of the Alu.We choose From the aligned sequencing data, we then select the set of concordant read pairs R such that for all r ∈ R at least one read overlaps [FL, FR].Given the set R for an Alu element, our Alu deletion algorithm has two steps.In the first steps it classifies reads in R and in the second step it computes relative likelihoods of the three genotypes.In order to distinguish these signals of the deletion haplotype H 1 from the ones that support haplotype H 0 , we classify read pairs r ∈ R into three types and remove those r from R that fulfill none of the types' criteria.In the end, we obtain a classification, C (R), that assigns each read pair a type: type(r) ∈ {I, S, A} for all r ∈ R. We use the notation r ∈ X if a read pair r is of type X. • S (Split).A read pair is of type S if one of the reads in the pair is a split read.A read is a split read if a part of it is mapped to the left of AL and the unmapped part aligns to the right of AR or a part of it is mapped to the right of AR and the unmapped part aligns to the left of AL. r 1 and r 4 in Figure 1 are examples for S reads.
When identifying split reads we realign the unmapped part of a read using the Smith Waterman algorithm (Smith and Waterman, 1981).We allow mismatches and small gaps, but require a minimal alignment score and at least 20 bp aligned on each side of the Alu.
The insert length indicates its origin from H 0 or H 1 .The read pairs r 2 , r 3 and r 6 in Figure 1 are examples for type A. Based on the classification of read pairs, C (R), we compute a relative likelihood, L, of observing the reads given each genotype, G 0 , G 1 , and G 2 .The following two paragraphs describe how we choose the likelihoods for observing each read pair given the haplotypes.Finally, we describe how we compute a joint likelihood for all read pairs given the three genotypes.
Breakpoint overlapping reads.Reads overlapping breakpoints, i.e. read pairs of types I and S, give strong evidence for Alu polymorphisms.S read pairs are most likely from an H 1 haplotype and I read pairs are most likely from an H 0 haplotype.As we are only interested in the relative likelihoods of the data given the genotypes, we fix the likelihood, L, of such read pairs given the corresponding haplotype as 1: To account for misalignment or sequencing error, we set the likelihood of observing a read pair of type I or S given the other haplotype to a parameter PE, chosen as 0.001 in our experiments: Spanning read pairs.Read pairs spanning across an Alu, i.e. read pairs of type A, have either an insert length distribution Y (r) if they come from haplotype H 0 or they align l Alu further apart if they originate from haplotype H 1 and, thus, have an insert length distribution Y (r) + l Alu .Therefore, we can derive the likelihood of observing a read pair of type A as: Joint likelihood.At a given Alu location, we assume that each read pair in the set R is independent.The likelihood of the observed read pairs given the true genotype G g , g ∈ {0, 1, 2} and the read classification C (R) is, thus, as follows: (1) where L(r|H) is given above and P(H x |G g ) is the probability of haplotype H x given genotype G g .We have P(H 0 |G 0 ) = P(H 1 |G 2 ) = 1.Assuming uniform sequencing coverage, and a read length of r , e.g., 100 bp, we use the estimate P(H 0 |G 1 ) = l Alu +2 * r l Alu +4 * r .

Alu insertion
Detecting Alu insertions is a more difficult problem than detecting Alu deletions, as potential insertion positions (breakpoints) are not known a priori.When considering multiple individuals simultaneously, it is preferable to know the precise breakpoints shared by all carriers of the Alu insertion in order to make accurate genotype calls.Therefore, we first select reads that indicate the occurrence of an Alu insertion, cluster these reads by location per individual, and then combine the clusters of multiple individuals.
Finally, we infer breakpoints for all candidate sites and insert a consensus Alu element in silico in order to apply the Alu deletion algorithm for genotype calling.

Informative reads
There are mainly two signals indicating the presence of an Alu insertion.The first is a discordant read pair where one read is mapped to a known Alu region, cf.r 5 and r 7 in Figure 2. We will refer to these discordant read pairs as D read pairs.The second signal is a split read where only the part at one side of the breakpoint aligns to the reference genome, cf.r 6 and r 8 in Figure 2. We denote these split reads as C (clipped) reads, as they are often soft-clipped in the BAM files.
Our Alu insertion algorithm uses D read pairs to identify candidate insertion sites, and C reads to pinpoint the precise breakpoints.D read pairs are widely used to infer approximate regions of Alu insertions (David et al., 2013;Keane et al., 2013;Hormozdiari et al., 2010), whereas comparatively fewer tools utilize C reads (David et al., 2013).

Insertion sites for multiple individuals
To study multiple individuals simultaneously, we first find candidate insertion sites in each individual separately.Next, we pool the sets of candidate sites from all individuals, sort them by position, and merge all overlapping regions into one region.As merging of overlapping regions can again lead to very long regions, we heuristically partition them into subregions as for a single individual.

Identifying precise breakpoints
We identify precise breakpoints shared by all polymorphism carriers in order to exclude false positive insertion sites.Due to the biological mechanisms that lead to Alu insertions, the breakpoint is often not a single position.In the data, we observe target site duplications (TSDs) and deletions, as well as Alu insertions that we can only characterize at one end, cf.Fig. 3.A TSD is a sequence of 4-25 bp repeated just before and after the Alu element (Deininger, 2011).Also common are short deletions that accompany an Alu insertion.The most difficult cases are those where not only the Alu sequence is inserted, but also some novel sequence.For such compound insertions, we are typically able to identify a breakpoint for one end only.

PrePrints
Given an Alu insertion site, we define AL as the left breakpoint if there is a C read whose left part is mapped to the reference and whose right part is soft-clipped at AL and can be aligned to an Alu sequence.
Similarly, we define AR as the right breakpoint of the Alu insertion if there is a C read whose right part is mapped to the reference and whose left part is soft-clipped at AR and can be aligned to an Alu sequence.
AL is not always equal to AR and often only one of them can be characterized, as illustrated in Fig. 3.In our implementation we allow AL and AR to differ by up to 50 bp.
Given a region (b i , e i ), we declare that the region contains an Alu insertion if at least one of AL and AR, can be unambiguously determined.Otherwise, this region is excluded from further analysis.We use the term breakpoint to describe a pair (AL, AR), or a single position AL or AR, when only one of the positions can be identified.
Ideally, all polymorphism carriers having C reads in this region will point to one single breakpoint.
However, this is often not the case as some split reads are merely sequencing and/or mapping errors, indicating false breakpoints.This problem is partially solved by using only good C reads.For example, a C read most likely indicates a true breakpoint if it has good base calling quality and can be aligned to the reference with good alignment score.Nevertheless, the remaining C reads often suggest multiple positions.
To determine the true breakpoint, we introduce a two-level voting system.In this voting system, AL input to the second level of the voting system and the location with the highest number of votes is chosen.

Determining genotype
If a breakpoint can be identified in a region (b i , e i ), we insert a consensus Alu sequence at the breakpoint in silico.This then allows us to adopt the Alu deletion algorithm for calling genotypes.

Simulation of test data
In order to assess the accuracy of our approach, we simulated two sets of sequencing data, SimDel and SimIns, based on human chromosome 21 (build 37).In SimDel, we modeled recurring Alu deletions by selecting 100 known Alu elements on the reference chromosome 21, assigning a frequency to each of them, and deleting the Alu elements at these frequencies from 200 copies of chromosome 21, which resulted in 200 haplotypes.In SimIns, we modeled recurring Alu insertions by first deleting 100 known Alu elements from the reference chromosome 21, assigning a frequency to each of them, and inserting the Alu elements at these frequencies back into 200 copies of the modified reference, which again resulted in 200 haplotypes.In SimIns, we used the modified reference that has all 100 Alu elements deleted in all further steps of the analysis.In both sets, we chose the frequencies of the Alu element deletions or insertions to be uniformly distributed between 0 and 1.The Alu locations were chosen randomly on the chromosome, with the constraint that no other Alu is found within 600 bp of the inserted Alu.
Of every simulated Alu polymorphism, each individual can have 0, 1 and 2 copies, corresponding to the genotypes G 0 , G 1 and G 2 , respectively.G 1 represents a heterozygote and G 2 a homozygote Alu carrier.
In our deletion data set SimDel, there are an average of 36.53 heterozygotes and 34.24 homozygotes per individual, and in the insertion data set SimIns, there are 37.53 heterozygotes and 33.66 homozygotes.
See Table 1 for more details on the simulated Alu counts.
From the Mason read simulation package, version 2.0 (Holtgrewe, 2010), we used the MasonVariator to add SNPs and small indels to each haplotype and the MasonSimulator to generate paired-end Illumina reads for each haplotype (see Fig. 4 for details).We merged haplotypes in pairs to obtain 100 diploid data sets for SimDel and 100 diploid data sets for SimIns.For each SimDel and SimIns we simulated two read sets, one at an average coverage of ∼10× and another at ∼25×.

Evaluation metrics
The advantage of a simulated data set is that we can measure accuracy by comparing the predicted genotype calls to the truth, including the accuracy of distinguishing heterozygous and homozygous calls.
We count predictions per group

RESULTS
In this section, we present our results of running PopAlu on both simulated data and on a human trio from the 1000 Genomes Project.We compare the results of PopAlu to RetroSeq, a tool specialized for discovery of transposable element insertions that has been shown to be superior to the programs Tangram and Tea (Keane et al., 2013).

Simulated data
We ran PopAlu on SimDel independently for each individual, and on SimIns jointly for multiple individuals.
Since RetroSeq does not report deletions, we ran it only on SimIns.As expected, the sensitivity of PopAlu increases on SimDel from 85.8% to 98.1% for the higher coverage data as more reads provide more information on the Alu polymorphisms.We observe a similar effect for on SimIns, although it is less pronounced than for SimDel.PopAlu performs consistently better than RetroSeq, as measured by sensitivity and FDR, with a much higher genotype calling accuracy.

Real data from a 1000 genome project trio
We ran PopAlu with default parameters on public data of a CEU trio from the 1000 genome project (father NA12891, mother NA12892 and daughter NA12878).Within a follow-up study of the 1000 genomes pilot project (Stewart et al., 2011), the trio was sequenced at 9 − 16× coverage and 186 Alu insertion loci

PrePrints
were randomly selected for PCR validation.We used high depth (> 75×) Illumina HiSeq data generated at the Broad Institute 1 , the same data set used for assessing RetroSeq.We compare our results to the set of PCR validated Alu polymorphisms (Stewart et al., 2011) and to results previously reported for RetroSeq (Keane et al., 2013).(Stewart et al., 2011) for each sample and the PCR columns the number of validated calls that are also predicted by PopAlu or RetroSeq.The distance(bp) column shows the average distance in basepairs between the predicted breakpoint and the true breakpoints reported by PCR.

As shown in
Further, PopAlu and RetroSeq identify almost the same number of validated Alu insertions.Based on these numbers, the RetroSeq authors report an average sensitivity of 98% for each sample.However, the true number of novel Alu insertions is unknown, so the sensitivity may be inflated as it is presumably based on only a subset of all insertions.We further examined the accuracies in pinpointing the exact breakpoints.The mean distance from the true breakpoint is about 2 bp for PopAlu and about 17 bp for RetroSeq (see Table 3).
Next, we compared our genotype calls with the PCR validated calls (see  (Stewart et al., 2011).
Finally, we counted non-Mendelian calls within the trio to estimate a lower bound on the FDR.
Non-Mendelian calls are Alu insertions calls in the child that do not follow the expected inheritance patterns according to the calls in the parents.We include calls private to the child as false positives, since the rate of de novo Alu insertions is estimated to be about 1 per generation only (Batzer and Deininger, 2002).In the call set of PopAlu, we find in total 13 non-Mendelian calls, providing a lower bound to the FDR of 0.7%.

Timing
We ran all computations on a desktop machine of a single 2.67 GHz Intel i5 processor.On average, the detection of Alu insertions with PopAlu took about 3 hours per sample from the CEU trio family.

DISCUSSION
We have presented PopAlu, a method to detect and genotype polymorphic Alu insertions.The method can detect polymorphisms where the Alu element is present in the reference as well as where it is not.Our results indicate that the method has comparable or higher sensitivity to other tools and it can accurately distinguish between homozygous and heterozygous carriers.Further the evaluation suggests that the false discovery rate of PopAlu is low and that the precision in determining breakpoints is high.
Despite the positive results achieved, PopAlu can be improved and extended.First, the Alu sequence used in the Alu insertion genotyping algorithm is a representative Alu sequence; a better sequence may be determined by local assembly.Second, our algorithm could additionally incorporate the length distribution of I and S reads.Third, PopAlu currently uses a greedy algorithm for finding insertion locations, it may benefit from the development of an optimal algorithm.In terms of extensions, PopAlu can be extended to find other types of retrotransposons, e.g.LINE elements.Further, we note that Alu elements are often inserted along with more sequence, which may possibly be detected by combining PopAlu with a local assembly approach, such as PopIns (Kehr et al., 2015).Finally, another future extension is the inclusion of somatic Alu insertion events.

Figure 1 .
Figure 1.Example read alignments at an Alu deletion site.Arrows show read directions.The blue part of the reads can be mapped to the reference outside of the Alu and the red part can be mapped to the Alu.(A) shows example reads from haplotype H 1 .(B) shows example reads from haplotype H 0 .A heterozygote diploid can have reads shown in both (A) and (B).

Figure 2 .
Figure 2. Example read alignment at an Alu insertion site.Arrows show read directions.The blue part of the reads can be mapped to the reference and the red parts are clipped or mapped somewhere else in the reference.(A) shows example reads from the non-Alu haplotype.(B) shows example reads from the Alu insertion haplotype.A heterozygote diploid can have reads shown in both (A) and (B).
and AR are voted for independently of each other.Each region (b i , e i ) is processed as follows: (1) C reads from all individuals in this region are extracted and remapped to the reference using a split-mapping algorithm(Emde et al., 2012), allowing part of the read mapped to the reference, the other part mapped to an Alu consensus sequence.The position where the mapping jumps from reference to Alu consensus sequence indicates the breakpoint.(2) The C reads of each individual are considered separately by the first level of the voting system.If the split alignment is successful, a C read adds its vote either for the respective left or right breakpoint position.For each individual the most common breakpoint position is determined and used in the next step.(3) The determined breakpoints from all individuals are collected as C t p , where t ∈ {0, 1, 2} indicates the true genotype and p ∈ {0, 1, 2} indicates the predicted genotype.Thus, C t p specifies the number of G p predictions where the true underlying genotype is G t .For example, C 01 and C 02 count the number of false positives.We define the number of true positive calls (TPN) to tolerate genotyping errors, i. e., T PN = C 11 +C 22 +C 12 +C 21 , and calculate the sensitivity and false discovery rate (FDR) as Sensitivity = T PN T PN +C 10 +C 20 and FDR = C 01 +C 02 T PN +C 01 +C 02 .

Figure 4 .PrePrintsFigureFigure 1 PrePrintsFigureFigure 2 PrePrintsFigureFigure 3
Figure 4.The commands used to simulate sequencing reads for 200 haplotypes of our SimIns data set.The input file chr21 del.fa contains human chromosome 21 with 100 Alu elements deleted, and the input file hap.vcf contains insertion records for a subset of these regions (see text for details).<SEED>was chosen between 0 and 199.

Table 1 .
Simulated Alu counts of 100 individuals.The sum column is the total counts of simulated Alu, the min and max column are the minimum and maximum number of Alu elements seen in one simulated individual.

Table 2 .
Summary of predicted Alu counts.C t p represents the number of polymorphic Alu predicted as of genotype p while the true underlying genotype is t.The counts of C t p are further grouped into 4 types, named as TP (True Positive), FN (False Negative), FP (False Positive) and GE (Genotype-calling Error).The definitions of Sensitivity and False Discovery Rate (FDR) are given in the main text.

Table 3 ,
PopAlu reports about 35% more Alu insertions than RetroSeq.On average PopAlu identifies 1426 Alu insertions per sample, while RetroSeq reports on average 1054.This is consistent with our results on simulated data, where PopAlu is more powerful than RetroSeq.

Table 3 .
Predicted and validated Alu Insertion calls for the CEU trio.The PCR(total) column provides the total number of PCR validated Alu insertion calls by

Table 4 )
. PopAlu has an average genotype calling accuracy of 99.1%, compared to an average of 72.6% for RetroSeq.The numbers show that RetroSeq has difficulties in distinguishing between homozygous and heterozygous carriers.

Table 4 .
Genotype calls of PCR validated Alu insertion calls for the CEU trio.C t p represents the number of polymorphic Alus predicted as of genotype p while the true underlying genotype is t.The true genotype was determined by PCR validation