This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Multiple sequence alignment (MSA) is widely used to reveal structural and functional changes leading to genetic differences among species, and to reconstruct evolutionary histories of related genes, proteins and genomes. Traditionally, proteins and their coding sequences (CDSs) are aligned and analyzed separately, but often drastically different conclusions were drawn on a same set of data. Here we present a new alignment strategy, Codon and Amino Acid Unified Sequence Alignment (CAUSA) 2.0, which aligns proteins and their coding sequences simultaneously. CAUSA 2.0 optimizes the alignment of CDSs at both codon and amino acid level efficiently. Theoretical analysis showed that CAUSA 2.0 enhances the entropy information content of MSA. Empirical data analysis demonstrated that CAUSA 2.0 is more accurate and consistent than nucleotide, protein or codon level alignments. CAUSA 2.0 locates in-frame indels more accurately, makes the alignment of coding sequences biologically more significant, and reveals several novel mutation mechanisms that relate to some genetic diseases. CAUSA 2.0 is available in website www.DNAPlusPro.com .
This is a submission to PeerJ Computer Science for review.