Ask a question about this section

Sparc: a sparsity-based consensus algorithm for long erroneous sequencing reads

View article
PeerJ

Main article text

 

Introduction

Methods

Building the initial graph

Aligning sequences to the backbone and building the whole graph

Sequences that align to the backbone sequence provide rich information about the ground truth sequence. Ideally, a most likely genome sequence should be searched as the consensus given all the input sequences. However, utilizing the multi-sequence information comprehensively requires computationally expensive operations such as pair-wise alignment of all the related sequences (Edgar, 2004; Larkin et al., 2007; Lee, Grasso & Sharlow, 2002; Rausch et al., 2009). Here we adopt a similar but simpler strategy in PBdagcon (Chin et al., 2013), by aligning all the sequences to the backbone, and modify the existing graph according to the alignments. Rather than creating an intermediate graph that needs to be refined or simplified (Chin et al., 2013; Rausch et al., 2009), we construct the final graph on the fly. We borrow the wisdom from constructing a de Bruijn/k-mer graph (Pevzner, Tang & Waterman, 2001; Ye et al., 2012): (i) If a query region suggests a novel path/variant, we create a branch and allocate new k-mer nodes and edges between these nodes. An example can be found in the upper half of Fig. 3B, when we align the last six bases of Seq1 to the existing graph. In this example, two new edges ACC and AAA with multiplicity 1, and one k-mer node CC are allocated. (ii) If a query region perfectly aligns to an existing region in the graph, we increase the edges weights in the region without allocating new nodes. Examples can also be found in Fig. 3B. When aligning the first five bases of Seq1 to the existing graph, nodes AC, GG and edge TGG are merged implicitly with the ones created by the original backbone, the edge weights are increased by 1. When aligning the last six bases of Seq2 to the existing graph, the nodes and edges are merged implicitly with the ones created by Seq1, the edge weights are changed accordingly. As previously mentioned, this construction process shares similarity with the construction of a de Bruijn graph, but the nodes in our graph are differentiated by their k-mers and their positions in the backbone. In addition, Sparc is designed to facilitate hybrid assembly, and can leverage more weight to the high-quality data. When different types of sequencing data are available, higher weights can be assigned to the more reliable edges. The resulting k-mer graph contains rich information about the underlying genomic region. Next we describe another simple technique to extract the most likely sequence as the consensus output.

Adjusting the weights of the graph

Output the heaviest path as the consensus

Implementation details

Results

Conclusion

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Chengxi Ye conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Zhanshan (Sam) Ma conceived and designed the experiments, wrote the paper, reviewed drafts of the paper.

Data Availability

The following information was supplied regarding data availability:

The research in this article did not generate any raw data.

Funding

The research received funding from the following sources: NSFC (Grant No: 61175071 & 71473243) and “Exceptional Scientists Program of Yunnan Province, China.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

34 Citations 9,804 Views 1,534 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more