This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
We present a semi-streaming algorithm for k-mer spectral analysis of DNA sequencing reads, together with a derivative approach that is fully streaming. The approach can also be applied to genomic, transcriptomic, and metagenomic data sets. We develop two tools for short-read analysis based on these approaches, a method for semi-streaming k-mer-based error trimming, and a method for the analysis of error profiles in short reads using a streaming sublinear approach. These tools are implemented in the khmer software package, which is freely available under the BSD License at github.com/ged-lab/khmer/.
This is a submission to PeerJ Computer Science for review.