CAM: An alignment-free method to recover phylogenies using codon aversion motifs

Department of Biology, Brigham Young University, Provo, Utah, United States
Brigham Young University, M.L. Bean Museum, Provo, Utah, United States
DOI
10.7287/peerj.preprints.27756v1
Subject Areas
Bioinformatics, Evolutionary Studies, Taxonomy
Keywords
phylogeny, codon usage bias, alignment-free, codon aversion, tree of life, taxonomy, maximum likelihood, phylogenomics, phylogenetics, systematics
Copyright
© 2019 Miller et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Miller JB, McKinnon LM, Whiting MF, Ridge PG. 2019. CAM: An alignment-free method to recover phylogenies using codon aversion motifs. PeerJ Preprints 7:e27756v1

Abstract

Background. Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an algorithm to quickly calculate distances between species based on codon aversion.

Methods. Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, where many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229 742 339 genes from 23 428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies.

Results. Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies.

Availability. CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Supplemental Note 1: Summary of CAM Options

A short summary of available parameters to modify the output from CAM.

DOI: 10.7287/peerj.preprints.27756v1/supp-1

The advantages and disadvantages of each phylogenetic comparison metric

The first column is the name of the metric. The second column is a short description of how the metric works. The third and fourth columns explain the advantages and disadvantages of each method, respectively.

DOI: 10.7287/peerj.preprints.27756v1/supp-2

Frequency of Partial Genes

This figure shows the proportion of partial genes in each clade. A partial gene is defined as a gene in which we do not have the entire DNA sequence available. Each boxplot represents the distribution of the proportion of partial genes in each species of the clade.

DOI: 10.7287/peerj.preprints.27756v1/supp-3

All Clades - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-4

Archaea - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-5

Bacteria - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-6

Fungi - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-7

Invertebrates - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-8

Mammals - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-9

Plants - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-10

Protozoa - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-11

Vertebrate Other - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-12

Viruses - Motifs Found in Multiple Species vs. Unique Motifs

Shows how many motifs are shared in different genes within the same clade versus how many motifs are unique to a single gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-13

All Clades - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-14

Archaea - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-15

Bacteria - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-16

Fungi - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-17

Invertebrates - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-18

Mammals - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-19

Plants - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-20

Protozoa - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-21

Vertebrate Other - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-22

Viruses - Frequency of Codon Aversion by Codon

The frequency of codon exclusion for the taxonomic group. The box plot represents the frequency of species in the taxonomic group that exclude a certain codon in their genes (e.g., if a codon is not used in 50% of a species' genes, then that species would be plotted at 0.50).

DOI: 10.7287/peerj.preprints.27756v1/supp-23

All Clades - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-24

Archaea - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-25

Bacteria - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-26

Fungi - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-27

Invertebrates - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-28

Mammals - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-29

Protozoa - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-30

Plants - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-31

Vertebrate Other - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-32

Viruses - Number of Codons Excluded in Motifs

Shows the frequency of how many codons (0-64) are not used in each gene.

DOI: 10.7287/peerj.preprints.27756v1/supp-33

All Clades - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. All clades outliers excluded: (1309911,1), (2185083,1), (2433089,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-34

Archaea - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Archaea outliers excluded: (10360,1), (10564,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-35

Bacteria - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Bacteria outliers excluded: (681998,1), (1085854,1), (1611727,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-36

Fungi - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Fungi outliers excluded: (140907,0), (157884,1), (226451,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-37

Invertebrates - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Invertebrates outliers excluded: (110662,1), (201864,1), (166597,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-38

Mammals - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Mammal outliers excluded: (81051,1), (105156,1), (17812,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-39

Plants - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Plants outliers excluded: (158430,1), (127795,1), (224688,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-40

Protozoa - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Protozoa Outliers excluded: (32139,1), (41048,1), (30539,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-41

Vertebrate Other - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Vertebrate other outliers excluded: (167892,1), (114746,1), (254804,1).

DOI: 10.7287/peerj.preprints.27756v1/supp-42

Viruses - Repeated Motifs

The frequency of which codon motifs are repeated is shown. The x- axis depicts how many time a motif was repeated in all the genes in a clade. The y-axis depicts how many motifs were repeated a given number of times (shown in the natural log). Some outliers were removed from each graph for clarity. These outliers represent the motifs in which only stop codons are excluded. Virus outliers excluded: (2669,1), (2167,1), (4664,1)

DOI: 10.7287/peerj.preprints.27756v1/supp-43