Ask a question about this section
Ask a question about this section
Ask a question about this section
0

The main text shows "Compare the second panel in row (A) with the second panel in row (G), which show that minimizers with w = 8 have compression 4.5 and conservation 0.47, while open syncmers with s = 3 and offset t = 2 have the same conservation with substantially better compression (5.9),". But in Figure 2, row (G) is not the offset open syncmers. Should it be the second panel in row S (panel T...

read more, vote or answer

waiting for moderation
Ask a question about this section
Ask a question about this section
Ask a question about this section

Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences

View article
Bioinformatics and Genomics

Main article text

 

Introduction

K-mers, submers and minimizers

Submer conservation

Submer distance distribution

Methods

Coding function

Minimizers

Definition

Compression factor

Mincode submers

Definition

Distance distribution

Modulo submers

Definition

Closed syncmers

Definition

Window length

Compression factor

This approximation is reasonable when 4sk, but breaks down when 4sk. For example, if s = 1, then a k-mer is a closed syncmer if the first or last letter has lowest code value (call it A), which occurs with probability 1/4 (first letter is A) + 1/4 (last letter is A) − 1/16 (subtract double-counting when both first and last letters are A) = 0.56 for any k > 1, and this is an under-estimate because there are additional closed syncmers containing no As. Thus, with s = 1 the compression factor is <1/0.56 = 1.79 for any k.

Maximum distance

Efficient spacing

Open syncmers

Definition

Compression factor

Spacing

Circular syncmers

Circular syncmers consider a k-mer sequence to wrap around and thereby to contain k distinct s-mers rather than ks + 1. Non-circular syncmers are described as linear if needed to distinguish these two types. For example, the 5-mer ACGTA contains 2-mers AC, CG, GT and TA if considered to be linear, plus AA if circular. With lexicographic coding, ACGTA is a (k = 5, s = 2) linear open syncmer because the first 2-mer AC is its smallest, but not a circular open syncmer because the smallest 2-mer (AA) is not the first. By increasing the number of s-mers, circularity enables higher compression for given k and s. For example, the compression factor is ~(ks + 1) for linear open syncmers which increases to ~k for circular open syncmers (by similar reasoning to Eq. (3)). Circular open syncmers do not have a window guarantee, as shown by the following counter-example with k = 4 and s = 2. The repeating string ACCACCACCA… contains three distinct 4-mers ACCA, CCAC and CACC where the position of the smallest 2-mer is 4, 3 and 2 respectively. For circular closed syncmers, I do not have a proof of a window guarantee or a counter-example.

Offset parameter

Down-sampled syncmers

Definition

Compression factor

Spacing

Prefix submers

Definition

Spacing

Speed optimization of submer identification

Submer properties on random sequences

Submer conservation with random mutations

Evaluation on whole-genome alignment

Genome pairs

Representative parameters

To select parameters representative of those used in practice, I chose minimizers (k = 15, w = 10) and (k = 31, w = 15) used by minimap2 (Li, 2018) and Kraken v1 (Wood & Salzberg, 2014), respectively. Comparable syncmer parameters were identified as those giving compression factors equal or better than the minimizers with equal or better seed conservation at 90% identity (Cons90).

Results

Comparison of syncmers with minimap2-like minimizers

Comparison of syncmers with Kraken v1-like minimizers

Table 4 shows properties for Kraken v1-like minimizers with selected k = 31 syncmers that achieve better compression and/or better conservation. For example, open syncmers with s = 31, t = 5 achieve compression of 11.0 vs. 8.5 for minimizers (29% lower density) with conservation of 0.081 vs. 0.077 (5% better).

Comparison of distance distributions

Figure 2 shows distance distributions for selected k = 8 submers to illustrate how the distribution changes with submer type and parameters. As noted in the “Introduction”, an ideal distribution would have modal frequency 1.0, but this is not possible in practice. A desirable feature is an upper bound w so that all distances >w have frequency zero; this is equivalent to the window guarantee. Also desirable is that short distances have low frequencies because these correspond to submers with long overlaps which are more likely to be deleted under mutations. With minimizers, all distances have approximately equal frequencies (see “Methods”), and short distances are therefore not suppressed. Open syncmers with offset t > 1 strongly suppress long distances and eliminate short distances, as expected (see “Methods”). Compare the second panel in row (A) with the second panel in row (G), which show that minimizers with w = 8 have compression 4.5 and conservation 0.47, while open syncmers with s = 3 and offset t = 2 have the same conservation with substantially better compression (5.9), which can be understood as a consequence of the more desirable distance distribution of the syncmers in addition to their context-independence.

Whole-genome alignment

Maximum distance under mutation

Parameter sweep

Discussion

Submer rules

Context-free submers do not have edge bias

Choosing submer parameters

The window guarantee is a weak heuristic

Density is not the appropriate optimization metric

Well-conserved submers

Connections between syncmers, UHSs and minimizers

Conclusions

Supplemental Information

Parameter sweep over submer properties.

Tabbed text file (gzip-compressed) with one line per submer which includes parameters such as k and w or s together with its compression factor, conservation, and distance distribution.

DOI: 10.7717/peerj.10805/supp-1

Additional Information and Declarations

Competing Interests

The author declares that he has no competing interests.

Author Contributions

Robert Edgar conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Source code is available at GitHub: https://github.com/rcedgar/syncmer.

Funding

This work was funded by the author.

80 Citations 5,855 Views 698 Downloads