PEER-REVIEWED
Note that a Preprint of this article also exists, first published September 15, 2016.

## Materials and Methods

### Implementation

Atropos is developed in Python (3.3+) and is available to install from GitHub or via one of several package managers (see Data Availability).

#### Semi-global alignment

We summarize the algorithm here; see Martin (2013), Section 2.2 for details. Let a and r be the nucleotide sequences of the adapter and sequencing read, respectively, and let m = |a|, n = |r|. Adapter alignment computes edit distances D(i,j) between the i-length prefix of a and the j-length prefix of r for all i = 0,…,m and j = 0,…,n with the standard dynamic-programming (DP) recurrence $D\left(i,j\right)=min\left\{D\left(i-1,j-1\right)+\left[{a}_{i}\ne {r}_{j}\right],D\left(i-1,j\right),D\left(i,j-1\right)\right\}$

The base cases are D(i,0) = 0 or D(i,0) = i and D(0,j) = 0 or D(0,j) = j, depending on the adapter type, allowing to skip a prefix of a and/or r at no cost. The algorithm additionally keeps track of M(i, j), which is the number of matches between the prefixes of a and r, and of the “origin” O(i, j), which is the number of skipped characters in r in the optimal alignment (if negative, characters in a are skipped instead). All three DP matrices D, M, O are filled in at the same time, after which the cells of the bottom row (i = m) are inspected. They represent possible end positions of the adapter sequence within the read. For each position j, the error rate is computed from D(m, j) and O(m, j), and positions with a too high error rate are discarded. If positions remain, the one with the highest number of matches M(m, j) is returned as the position J of the adapter sequence. Together with the start of the adapter sequence at O(m, J), the adapter sequence can then be removed from the read.

Observing that no backtrace within the DP matrix is required, the actual implementation keeps only a single column of the matrices in memory for better cache locality. Significant runtime improvements are achieved by employing the optimization described by Ukkonen (1985) of stopping the computation of a column as soon as the costs are too high and provably cannot decrease for the remainder of the column. When the user supplies an anchored adapter and disables insertions and deletions (indels) at the same time, the algorithm also switches to a much simpler variant that computes only the Hamming distance between the adapter and a prefix or suffix of the read.

#### Insert match algorithm

For each read pair, the insert-match algorithm uses the same semi-global alignment algorithm described above (with indels disabled) to find all possible alignments between the first read and the reverse complement of the second read that satisfy specificity thresholds (Fig. 1C). Specificity is determined by the combination of up to three user-configurable thresholds: (1) minimum number of overlapping bases; (2) maximum number of mismatch bases; and (3) random mismatch probability (Sturm, Schroeder & Bauer, 2016). The probability of a random match at k bases out of the n bases being compared is computed using the binomial distribution: $P=\sum _{i=k}^{n}\frac{n!}{i!\left(n-i\right)!}{p}^{i}{\left(1-p\right)}^{n-i}$

Optionally, the overlapping inserts can be used for mutual error correction (Fig. 1D). Where the aligned inserts have mismatches, the base with the highest quality score is chosen as the consensus. When the bases have equal quality, there is an option to leave the bases unchanged, convert them both to N, or to choose the base from the read with the highest mean quality as the consensus. There are additional options to (1) completely overwrite one read in the pair if its quality is very poor; and/or (2) merge the overlapping read pair into a single read, which avoids double-counting overlapping read pairs in read depth-based analyses.

If no insert match is found, or if an adapter is not found in an overhang, then an unconstrained adapter-match approach is attempted separately in each read (Fig. 1E).

#### Parallel processing

The performance improvements in Atropos relative to Cutadapt and other read trimming tools are based in two observations: (1) each read (or read pair) is trimmed separately, and thus trimming can be parallelized across multiple processor cores; and (2) a significant fraction of the execution time is spent decompressing input files and re-compressing results. Compression of sequencing data is increasingly becoming necessary due to the large volumes of data generated in sequencing experiments.

$X\left(S\right)=-\sum \frac{C\left(i,S\right)\cdot \text{log}\left(C\left(i,S\right)\right)}{\text{log}\left(2\right)}$

Sequences with X(S) < 1.0 are defined as low-complexity. All remaining k-mers are counted, and each k-mer is linked to all of the sequences from which it originated. This process continues iteratively for increasing values of k, with only those read sequences linked to high-abundance k-mers in the previous iteration being used to build the k-mer profile in the next iteration. k-mer K is considered high-abundance when: $|K|>\frac{N\cdot \left(l-k+1\right)\cdot O}{{4}^{k}}$ where l is the read length and O = 100 by default. Finally, high-abundance k-mers of all lengths are merged to eliminate shorter sequences that are fully contained in longer sequences.

Atropos reports to the user an ordered list of up to 20 of the most likely contaminants. Because adapter sequences have been designed not to match any known sequence in nature, a sequence (or pair of sequences) that occurs at high frequency and matches a known adapter sequence is likely to be the true sequence(s) used as adapters in the dataset. Thus, our algorithm optionally matches the high-abundance k-mers to a list of known adapters/contaminants. We provide a list of commonly used adapter sequences, or the user can choose to supply their own. When a contaminant list is not provided, or when the adapter does not match a known sequence, we advise the user to take caution when using the results of this detection process, as a highly abundant sequence might simply be derived from a frequently repeated element in the genome.

#### Error rate estimation

Quality and adapter trimming is sensitive to the choice of several parameters. For example, relative to datasets with typical rates of sequencing error, datasets with higher error-rates require higher thresholds for mismatches and/or random-match probability during insert- and adapter-matching to perform with the same level of sensitivity. Thus, we implemented in Atropos a command that provides an estimate of the error rate in each input file. The error command gives the choice between two algorithms: (1) averaging all base qualities across a sample of reads, which is fast but likely overestimates the true rate of sequencing error (Dohm et al., 2008; DePristo et al., 2011); and (2) the shadow regression method proposed by Wang et al. (2012), which more accurately estimates error rates at the cost of reduced speed and greater memory usage.

#### Quality control metrics

Examination of QC metrics is another important aspect of sequence analysis pipeline. For example, the widely used FastQC (Andrews, 2010) tool generates statistics such as per-sequence and per-base quality scores and GC content, sequence length distribution, sequence duplication levels, and frequency of potential contaminants. QC is commonly performed both before and after read trimming to identify any systematic data quality issues, to observe the improvements in data quality due to trimming, and to ensure that trimming does not introduce any unintended side-effects. Since both read trimming and QC involve iterating over all reads in the dataset, we reasoned that implementing both operations in the same tool would reduce the overall processing time, and also eliminate the need to install two separate tools. Thus, we implemented an option in Atropos to collect QC metrics before and/or after trimming.

Additionally, we implemented an Atropos module for MultiQC (Ewels et al., 2016), a program that generates nicely formatted reports from metrics output by a variety of bioinformatics tools for one to many samples. Given summary files generated by Atropos (one per sample, in JSON format), the MultiQC module will generate interactive versions of the same static plots offered by FastQC, as well as a summary table of the most important metrics.

#### Shared Cutadapt and Atropos improvements

In addition to improvements in the semi-global alignment algorithm above, Atropos also benefits from the following improvements that were made to Cutadapt subsequent to the publication of Martin (2011), but prior to the Atropos fork, and are therefore features in both programs.

• Adapters can now be anchored, which limits the read positions at which they will be matched. An anchored 5′ adapter thus matches only if it is a prefix of the read, and a 3′ adapter only if it is a suffix of the read. This is useful, for example, when one or both sequencing adapters are known to be ligated directly to a PCR primer.

• IUPAC ambiguity codes are fully supported. Thus, adapter sequences containing characters such as N (matching any nucleotide), H (A, C, or T), Y (C or T) work as expected. They are useful when adapters contain barcodes or random nucleotides. The nucleotides and ambiguity codes are internally represented as patterns of four bits, in which each set bit corresponds to an allowed nucleotide. Comparisons are thus simple “binary and” operations, resulting in no runtime overhead.

• Paired-end data can be trimmed with sequences specified for the forward and reverse reads independently. Read pairs are guaranteed to remain in sync. Even interleaved data (paired-end reads in a single file) is accepted.

• Quality trimming can now work in a NextSeq-specific mode in which spurious runs of high-quality G nucleotides at the 3′ end of a read are correctly trimmed. NextSeq instruments use “dark” or “black” cycles for G nucleotides, making them unable to distinguish between regular G and reaching the end of the fragment.

• Other additions include support for trimming a fixed number of bases from a read, support for files compressed using the bzip2 and lzma algorithms, and improved filtering options.

### Benchmarks

#### Simulated data

We evaluated both the speed and the accuracy of Atropos relative to other state-of-the-art read trimming tools using both simulated and real-world data (Table 1). As trimming of single-end reads is unchanged from the original Cutadapt method and is also decreasing in relevance as most current experiments use paired-end data, we focused our benchmarking on trimming of paired-end reads. Sturm et al. demonstrate that Skewer (Jiang et al., 2014) and SeqPurge (Sturm, Schroeder & Bauer, 2016) stand out as having superior performance in paired-end read trimming, and Schubert et al. also demonstrate exceptional performance of AdapterRemoval (Schubert, Lindgreen & Orlando, 2016); thus, we chose to benchmark Atropos against these tools. We also compared the new insert-match algorithm against the adapter-match algorithm that is used by Cutadapt, and which can be enabled in Atropos using the “--aligner” command line option.

To simulate paired-end read data, we use the ART simulator (Huang et al., 2012) that was modified by Jiang et al. to add adapter sequences to the ends of simulated fragments. ART simulates reads based on empirically derived quality profiles specific to each sequencing platform. A quality profile consists of distributions of quality scores for each nucleotide at each read position, expressed as read counts aggregated from multiple sequencing experiments, where quality scores are in Phred scale (−10log10(e), where e is the probability that the base call is erroneous). We developed an R script to artificially inflate the error rates in an ART profile to a user-defined level. For each row in the profile with quality score bins e1..en and corresponding read counts r1..rn, the overall error rate can be computed as: $E=\frac{{\sum }_{i=1}^{n}{e}_{i}{r}_{i}}{{\sum }_{i=1}^{n}{r}_{i}}$

We use the R function optim with the variable metric (“BFGS”) algorithm to optimize a function that adds an equal number of counts C to each Phred-score bin in the distribution and then compares the overall error rate to the user-desired error rate E′: $f\left(C,{E}^{\prime }\right)=\frac{{\sum }_{i=1}^{n}{e}_{i}\left({r}_{i}+C\right)}{{\sum }_{i=1}^{n}\left({r}_{i}+C\right)}-{E}^{\prime }$

We simulated ∼800k 125 bp paired-end reads using the Illumina 2500 profile at error rates that were low/typical (∼0.2%, the unmodified profile), intermediate (∼0.6%), and high (∼1.2%). We evaluated the accuracy of the tools on the simulated data by comparing each trimmed read pair to the known template sequence. We counted the frequency of following outcomes: the fragment does not contain adapters but is trimmed anyway (“wrongly trimmed”), or the fragment does contain adapters but either too few bases or too many bases were removed (“under-trimmed” or “over-trimmed”). We also counted the total number of under- and over-trimmed bases.

#### Real data

We also benchmarked the tools on two real-world datasets. First, we sampled ∼1 M read pairs from a whole-genome bisulfite sequencing (WGBS) library generated from the GM12878 cell line. Second, we used 6.1 M paired-end mRNA-seq reads generated from the K562 cell line. Both of these datasets were generated by the ENCODE project (ENCODE Project Consortium, 2012). Since the genomic origins of the templates are not known a priori, we instead compared the read trimming tools based on improvement in the results of mapping the trimmed versus untrimmed reads. We used STAR (Dobin et al., 2013) to map the mRNA-seq reads to GRCh38, and we used bwa-meth (Pedersen et al., 2014) to map the WGBS reads to the bisulfite-converted GRCh38. We also compared the results of only adapter trimming to the results of adapter trimming plus additional quality trimming using a minimum quality threshold of 20 (Phred-scale).

One characteristic of the mRNA-Seq dataset is that average read 2 quality is substantially lower than read 1 (estimation by the “atropos error” subcommand: 6.7% versus 2.0%). In practice, when encountering a read pair in which one end is of much lower quality than the other, the Skewer algorithm essentially overwrites the former with the later, leading to more precise alignment. Atropos provides a specific option for this case (“--w”), which we make use of in our benchmark in order to fairly compare Atropos with Skewer. However, this gives these tools a perhaps unfair advantage over AdapterRemoval and SeqPurge which do not have such an option.

#### Computing environments

Although sequence analysis is sometimes performed using a desktop computer, analysis of the volumes of data currently being generated increasingly requires the use of high-performance computing facilities (“clusters”). The hardware architecture of a cluster is often different from that of a desktop computer. Most importantly, storage in a cluster is typically centralized and accessed by the compute nodes via high-speed networking. Such an architecture inevitably adds latency to file reading and writing operations (“I/O”). Cluster nodes also typically have more processing cores and memory available than desktop computers. This means that the performance of software with intensive I/O usage (such as read trimming) is likely to be quite different on a desktop versus a cluster. To examine the impact of these architectural differences, we ran the benchmarks for simulated data on both a desktop computer (a Mac Pro) having a 3.7 GHz quad-core Intel Xeon E5 processor and 32 GB RAM, and on a cluster node having 64 2.4 GHz Intel Xeon E5 cores and 256 GB memory, and with all data being read from and written to network-accessible storage over a 1 Gbit ethernet connection.

#### Reproducibility and reusability

With increasing importance being placed on both the reproducibility of results in scientific publications and the reusability of software and pipelines, we endeavored to provide a benchmark workflow that can be easily executed and extended by anyone with access to modern computing resources.

First, we “containerized” all of the software tools used in this paper—including trimming tools, read mapping tools, and supplementary tools used to evaluate results and generate tables and figures (Table S1). We also created minimal containers for all of the data used in this paper—including benchmark datasets, reference genomes, annotation databases, and indexes used by the mapping tools. Specifically, we created Docker (Boettiger, 2015) image specifications (“Dockerfiles”), generated the images, and uploaded them to a public repository on the Docker Hub (see Data Availability).

Second, we implemented our benchmark workflow using the Nextflow (Di Tommaso et al., 2017) framework. Importantly, Nextflow enables workflows to be run either locally or in most cluster environments, and supports running containerized software via either a Docker or Singularity (Kurtzer, 2016) client (depending on the operating system).

Instructions for running our workflow, along with all of the source scripts, are available in our GitHub repository (see Data Availability).

## Results

### Simulated data

#### Performance

On a desktop computer with four processing cores, we found that AdapterRemoval had the fastest overall execution time, followed closely by SeqPurge, Atropos (in parallel write mode), and Skewer (Fig. 2A; Table S2).

As expected, execution times on a cluster node using four threads were approximately 20% greater than those observed on a desktop computer (Fig. 2B; Table S3). We expect that much of this disparity is due to the increased latency involved in network-based I/O on the cluster, although some may also be explained by CPU differences (3.7 GHz Intel on the desktop versus 2.4 GHz on the cluster node).

When increasing the number of parallel execution threads from 4 to 8 and 16, Atropos achieves a roughly linear decrease in execution time. Interestingly, the execution times of AdapterRemoval, SeqPurge, and Skewer do not substantially decrease when increasing the number of the threads from 8 to 16. With 8 and 16 threads, Atropos using the adapter-match algorithm in parallel-write mode is the fastest of the tools, and with 16 threads Atropos using the insert-match algorithm in parallel-write mode is also faster than the other three tools (Table S3).

Atropos uses substantially more memory than the other tools (Fig. S1; Table S4). We expect this is partially due to overhead of automatic memory management in Python compared to C++ (in which AdapterRemoval, SeqPurge, Skewer are implemented), but in larger part results from Atropos’ use of in-memory queues to communicate between parallel processes. For all four programs, memory usage increases slightly with increasing number of threads. We note that Atropos provides parameters to limit memory usage (although typically at the expense of reduced speed).

For most datasets and thread counts, Atropos and Skewer typically achieve the highest mean CPU utilization, indicating that they are less I/O-bound than AdapterRemoval or SeqPurge (Fig. S2).

#### Accuracy

In terms of numbers of over- and under-trimmed bases, the Atropos insert-match algorithm and SeqPurge clearly had the best performance (Table 2) at all sequencing error rates. The two algorithms had similarly low numbers of under-trimmed bases, but the Atropos insert-match algorithm had a lower number of over-trimmed bases, giving it the lowest overall error rate (0.0002%). On the other hand, Skewer and the Atropos adapter-match algorithm left substantial numbers of under-trimmed bases while AdapterRemoval was again biased toward over-trimming.

Additionally, we found that all tools discarded very similar numbers of reads (1.8%) that were below the minimum length threshold of 25 bp after trimming. These were reads with short insert sizes, which have a high rate of spurious mapping, and thus it is common practice to discard them.

### Real data

#### Performance

We performed adapter trimming on the real datasets in the same cluster environment. Again, we found that AdapterRemoval had the fastest execution time (Fig. 3; Tables S5 and S6). When trimming the WGBS data with 16 threads, Atropos (using the insert-match algorithm in parallel-write mode) was nearly as fast as AdapterRemoval (Fig. 3A; Table S5), while on the mRNA-Seq data Skewer, SeqPurge, and Atropos were all 30%–50% slower than AdapterRemoval (Fig. 3B; Table S6).

We also performed read mapping on the cluster with 16 cores. Mapping times were very similar for all algorithms on both the WGBS and mRNA-Seq datasets, and were much faster than for the untrimmed reads (Fig. S3).

#### Effectiveness

We assessed read trimming effectiveness in practical terms. For the WGBS data, we computed the number of trimmed reads mapped at a given quality (MAPQ) cutoff, relative to the number of untrimmed reads mapped at that cutoff. We found that trimming by Atropos resulted in the greatest increase in number of mapped reads at all quality cutoffs (Fig. 4A). Trimming with SeqPurge, Skewer, and AdapterRemoval resulted in similar, but smaller, gains in mapping quality. At the highest MAPQ thresholds (45, 50, 55), Atropos substantially outperforms the other three tools.

We also found that additional quality trimming in addition to adapter trimming has a substantial negative effect on read mapping, at least for bisulfite reads mapped using bwa-meth (Fig. 4B). Quality trimming by Skewer had the least negative effect on mapping quality of the four programs, and quality trimming by AdapterRemoval had the greatest negative effect on mapping quality.

For the mRNA-seq data, we additionally compared each alignment to GENCODE (v26) gene annotations (Harrow et al., 2012) to determine the number of reads mapped to expressed regions of the genome. We found that trimming with Atropos resulted a greater number of mapped reads aligned to expressed regions compared to the other tools at all MAPQ thresholds (Fig. 5).

## Conclusion

Our results demonstrate that adapter trimming tools are approaching optimal accuracy, at least for the (currently) most common type of data—paired-end short reads with 3′ adapters. On synthetic data with varying error rates, Atropos (using our new insert-match algorithm) and SeqPurge both demonstrated overall error rates of 0.01% at the read level, and Atropos has the lowest base-level error rate of 0.0002%.

On real WGBS and mRNA-seq data, we found that adapter trimming with Atropos resulted in the greatest increase in read mapping quality. We also found that stringent quality trimming has a negative effect on WGBS read mapping quality, at least when using bwa-meth as the alignment tool. For reads trimmed with a quality threshold of 20, all mapping statistics were worse than those for untrimmed reads.

In terms of performance, AdapterRemoval and SeqPurge had the best performance of the four tools tested when only four threads were available, while Atropos had superior performance on the simulated datasets and competitive performance on the real datasets when there were at least eight threads available. Of the three write modes, Atropos performed best in parallel-write mode. However, parallel-write mode has the trade-off of producing a larger number of data files, which may make analyses of large projects more complicated to manage. Atropos’ memory requirements were the highest among the four programs (3–4 GB versus 0.5–1.5 Gb), but well within the capabilities of most modern computer systems.

In summary, our results show that Atropos offers the best combination of accuracy and performance of the tools that we evaluated. Furthermore, Atropos has the richest feature set of the four tools, including Methyl-Seq-specific trimming options, automated adapter detection, estimation of sequencing error, computation of quality-control metrics before and after trimming, and support for data generated by many sequencing methods (ABI SOLiD, Illumina NextSeq, mate-pair libraries, and single-end sequencing). Although we have not optimized Atropos for long-read data (e.g., PacBio and Nanopore), it should work on those datasets given appropriate parameter settings, and we plan to soon provide explicit long-read support.