In polyploid genomes, homoeologs are a specific subtype of homologs, and can be thought of as orthologs between subgenomes. In Orthologous MAtrix, we infer homoeologs in three polyploid plant species: upland cotton (

Polyploidy is an important and widespread phenomenon within the plant kingdom (

Polyploidy | Having more than two sets of homologous chromosomes; the result of genome doubling. |

PAM units | Point accepted mutation. A measure of evolutionary distance; the amount of amino acid substitutions per 100 amino acids of a protein sequence. One PAM unit means that 1% of the amino acids were replaced since the divergence of the two protein sequences. |

Homoeolog | Genes of an allopolyploid which started diverging by a speciation event, and were brought back to the same genome via a hybridization event. They can be thought of as orthologs between subgenomes. |

Allopolyploid | A species which has more than one set of homologous chromosomes due to a whole genome duplication via hybridization. |

Subgenome | One of the genome sets in a polyploid. |

Synteny | The degree of gene position conservation between two diverging segments of chromosomes, in this case between two homoeologous chromosomes. |

Evolutionary distance | The amount of divergence between two protein sequences. |

Total copy number | An assessment of the amount of duplication for a given homoeolog pair. In this paper, it is the sum of the homoeologs for both genes of the pair. |

Since 2015 we have included pairwise homoeolog predictions between subgenomes in Orthologous MAtrix (OMA), which is a method and database for inferring evolutionary relationships (

However, relaxing one-to-one and synteny criteria makes it harder to distinguish correct from incorrect calls. Furthermore, because of the redundancy and size, polyploid genomes can be difficult to assemble and annotate. For instance,

There have been several methods reported which yield quantitative confidence scores for ortholog or paralog predictions. For example, InParanoid assigns confidence scores to in-paralogs on a scale from 0 to 100 depending on how distant the predicted inparalog sequence is from the “main” ortholog. Additionally, InParanoid confidence scores are assigned to orthologous groups based on a technique that assigns a higher score to potential ortholog sequences that have much better bootstrap value than competing ortholog sequences (

To our knowledge, there have not been any quantitative confidence scores of homoeolog predictions reported. However, there has been qualitative confidence reported for some polyploids. In a paper by

Here, we introduce a more fine-grained and flexible confidence score for homoeolog predictions. Based on fuzzy logic, it combines evolutionary distance, local synteny, and the extent of duplication. Fuzzy logic is about “degrees of truth” rather than binary true or false and is based on the idea that how true or not something is can be represented over a continuum. (See

Fuzzy logic | A type of mathematical logic based on natural language where truth is considered on a continuous scale as degrees of truth rather than binary true or false. Fuzzy logic resembles human reasoning and intuition because it uses classes with unsharp boundaries, defined with natural language. |

Control system | The mathematical models which make up the fuzzy inference process. |

Universe of discourse | A set of all possible values defined for a fuzzy input or output. |

Membership function | A function, normally visualized graphically, which denotes a fuzzy set. The membership function represents the degree (between 0 and 1) to which an element in the universe of discourse belongs. The membership functions represent fuzzy sets also represent linguistic variables, which overlap so that an input may belong to two categories, each to a certain degree. |

Fuzzification | The process of translating a crisp input to a fuzzy one. This is the first step of the fuzzy inference process, where the crisp input gets mapped to its fuzzy set based on the membership functions. |

Defuzzification | The process of converting the fuzzy output derived from the fuzzy inference process to a crisp output. |

Fuzzy rules | A set of “if … then” rules needed for mapping the fuzzy input to the fuzzy output. These rules are based on a human’s expertise, knowledge, and intuition. The fuzzy rules are defined and stored in the lookup table. |

Crisp input or output | Input or output which has a quantitative value, limited to the range of the universe of discourse. |

Fuzzy inference process | The fuzzy inference process consists of taking the crisp input, fuzzifying it, combining it with the fuzzy rules, and defuzzyfying, resulting in a crisp output. |

Fuzzy set | A set with unsharp boundaries, as defined by the membership function. Fuzzy sets allow for its members to belong to more than one set the same time, to some partial degree. |

In the latest OMA release, we include three agriculturally important allopolyploid crops:

Upland cotton (_{t}A_{t}D_{t}D_{t}; 2

Oilseed rape (

Wheat (

In OMA, we inferred homoeologs in the polyploid species by treating each subgenome as a separate genome and inferring orthologs following the normal OMA pipeline (

Recently, several improvements to the OMA algorithm were introduced (

All data, including the genome information and homoeologous/orthologous relationships, is stored in an HDF5 database (

Three inputs, that is, features of a given homoeolog pair, were used for the fuzzy logic variables: the evolutionary distance, synteny score, and total copy number.

Synteny is the overall conservation of chromosome order and location of genes when comparing two chromosomes. However, rearrangements may result in smaller regions of the chromosome being syntenic, rather than the whole chromosome. Thus, we computed a local synteny score for each pair of homoeologs (

For each pair, a window of 10 genes surrounding each homoeolog was obtained. The synteny score is the mean proportion of genes in the windows that are homoeologous. (A) An example of a pair with a high synteny score and (B) example with a low synteny score.

The evolutionary distance is based on the number of nucleotide substitutions between two sequences. This is in PAM units, and calculated as part of the normal OMA algorithm (

The “total copy number” is a metric to understand the degree of duplication for a pair of homoeologs. For a given pair, it is calculated as the number of homoeologs for the first gene + the number of homoeologs for the second gene.

For each genome, input variable (synteny, evolutionary distance, total copy nr), and output variable (confidence score), the universe of discourse is the range of possible values. The universe for each variable was are defined in

Variable | Input or output | Minimum | Maximum | Step |
---|---|---|---|---|

Distance | Input | 0 | Distance max | 0.01 |

Synteny score | Input | 0 | 1 | 0.01 |

Total copy nr | Input | 2 | Total copy nr max | 1 |

Confidence | Output | 0 | 100 | 1 |

Variable | Membership class | Central point | Standard deviation |
---|---|---|---|

Distance | Low | 0 | Distance maximum/10 |

Distance | Med | Distance maximum/4 | Distance maximum/10 |

Distance | High | Distance maximum | Distance maximum/2.5 |

Synteny | Low | 0 | 0.15 |

Synteny | Med | 0.3 | 0.15 |

Synteny | High | 1 | 0.4 |

TotalCopyNr | Low | TotalCopyNr median | TotalCopyNr median |

TotalCopyNr | Med | 4 × TotalCopyNr median | 1.5 × TotalCopyNr median |

TotalCopyNr | High | TotalCopyNr maximum | TotalCopyNr maximum/2.5 |

Confidence | Very low | 0 | 20 |

Confidence | Low | 50 | 10 |

Confidence | Med | 70 | 10 |

Confidence | High | 90 | 10 |

Confidence | Very high | 100 | 10 |

Each membership class is a gaussian curve, with the central point and standard deviation defined here.

We created five rules based on the universes defined above and stored them in the lookup table (see Results). The control system and simulation were made using the skfuzzy control module and the rules. This simulation was then used to defuzzify, that is, return a crisp output, using the centroid defuzzify method. It takes the inputs and returns a confidence score between 0 and 100. We then kept the smallest confidence score returned as the minimum and scaled the maximum confidence score to be 100. A set of 30 homoeolog pairs were manually evaluated in

We inferred homoeologs for three polyploid species using OMA:

Rearrangements may result in smaller regions of the chromosome being syntenic, rather than the whole chromosome. In order to justify using a local synteny score rather than a global synteny based on chromosome matching between subgenomes, we computed the number of homoeologs across pairs of chromosomes between two subgenomes in each species. With OMA we predicted many homoeologs across different chromosome groups (

Homoeologs that were on scaffolds or “randoms” were mapped to their respective chromosomes. “Off-diagonal” chromosomes, that is, different chromosome groups, with an increased frequency of homoeologs are consistent with known reciprocal translocations.

Non-homoeologous chromosomes, that is, different chromosome groups, with increased frequency of homoeologs are consistent with known reciprocal translocations. For example, in GOSHI, there are two known large reciprocal translocations: parts of the chromosomes were exchanged between A02 and A03, as well as between A04 and A05. We would be able to see this by an increased frequency of homoeolog pairs inferred between chromosomes not belonging to the same chromosome group (

Chromosome segments were exchanged between A02 and A03, as well as A04 and A05 in subgenome A. This would result in an increased frequency of homoeolog pairs predicted between chromosomes A02/D03, A03/D02, A04/D05, and A05/D04, which we observe with the homoeolog pairs inferred by OMA. In this figure, chromosome segments of the same color between subgenome A (A) and subgenome D (B) are those with a high frequency of homoeolog pairs.

Additionally, chromosome pairs with a few number of homoeologs may represent single-gene translocations between non-homoeologous chromosomes and should not be discarded from homoeology prediction via synteny-based methods. Taken together, these results suggest that a local synteny score is more robust than global synteny in order to account for large and small translocations.

For each homoeolog pair, we used the synteny score, the evolutionary distance, and the total copy number as input.

The synteny score is the degree of local gene neighborhood conservation. Although synteny is not a hard requirement for homeologs, a conservation of synteny is a good indicator of correct homoeolog predictions. In order to account for chromosomal rearrangements, as well as genome assemblies which are not yet fully assembled into pseudomolecules, we computed a local synteny score. This technique, however, only works when both genes of a homoeolog pair have at least one neighbor gene. Therefore, we could not compute the synteny score for homoeologs that were on small scaffolds with only one gene annotated. For those pairs we set the synteny score to zero. This was 490, 0, and 36,250 pairs for GOSHI, BRANA, and WHEAT, respectively.

The evolutionary distance is based on the number of nucleotide substitutions between two sequences (in PAM units). Because of the relatively short divergence between subgenomes, we expect there to generally be a low distance between homoeologs. Additionally, genes which have a high number of predicted homoeologs could indicate something suspect, such as a transposable element (TE) misannoted as a gene. All distributions of the inputs are shown in

(A–C) Distribution of synteny scores for pairs of homoeologs for

The membership functions allow us to translate an input value into a degree of membership between 0 and 1. The membership curves are overlapping to account for the fuzziness between categories, and for all genomes we defined what we consider to be low, medium, or high input values (

The output from our fuzzy inference process, Confidence, also has a membership function. It is used to map the fuzzy confidence to a crisp confidence score, between 0 and 100 (

(A) Lookup table. The first three columns are the inputs. The final column is the output (confidence), and reflects the same colors as the confidence membership curves in (B).

After defining the rules, we created a control system and simulation for each of the genomes. The inputs for each homoeolog pair were then fed into the simulation which contains the rules, and defuzzified. The defuzzification process converts the fuzzy linguistic confidence to a crisp confidence score, which we then scaled between the minimum value and 100. The reason for scaling to 100 is so that there would not be a sharp cutoff and a maximum confidence score around 80. This facilitates comparison of homoeolog confidence scores within genomes, as people naturally tend to associate the best score with 100. The resulting distribution of confidence scores is shown in

We assessed the homoeolog predictions by looking at the correlation between the total number of orthologs per homoeolog pair and the confidence score. We also manually evaluated a set of 10 homoeolog pairs from 0–60, 60–90, to 90–100 confidence score ranges (

The total number of orthologs takes into account the ortholog predictions for all of the species in OMA. Homoeolog pairs with few orthologs are either lineage-specific or dubious, whereas pairs with many orthologs represent those more likely to be true. Although the correlation is low between the total number of orthologs and the confidence score (

Interestingly, in GOSHI, for the set of 10 manually evaluated pairs from 0 to 60 confidence, half had RVT-3 (reverse transcriptase-like) domains. According to the CDD description, “This domain is found in plants and appears to be part of a retrotransposon”. This could explain the few number of orthologs for those with low confidence scores, because TEs rapidly evolve and may have lost homology in the other species. By contrast, none of the sampled homoeologs with a confidence score above 60 had RVT-3 annotation, or any functional description associated with TEs.

Finally, in order to compare our new confidence scores to the previous confidence classes in OMA, we looked at the proportion of pairs in each confidence score bin which were previously marked as either high or low confidence (

Fuzzy logic has some applications in biology (

Although our methods of defining membership functions and rules for the confidence scores may seem ad hoc, that is where fuzzy logic excels. We don’t claim this method to be objective or the best schema, however, after manual evaluation, the results are interpretable and useful. An important limitation to our approach is that the confidence scores are heavily based on synteny, which is known to degrade over evolutionary time. Therefore, synteny may be low or even undetectable for older polyploids. The species used in this study are relatively young polyploids, and this approach was untested in paleopolyploids. Nevertheless, the results are relevant for the three polyploid species in OMA, and are reproducible as well, as they are coded into the OMA pipeline.

Synteny has been used already as a way to assess the confidence of ortholog pairs. For example, in Ensembl, a “gene order conservation” score uses a window of two genes on each side of a given ortholog prediction and checks whether the genes are also orthologs and in the same orientation. Furthermore, they calculate a “whole genome alignment” score which assesses the proportion of a given ortholog pair which fall within syntenic regions, with more weight given to exons that can be aligned than introns (

The new homoeolog confidence scores are an improvement to the old way we assigned confidence class. Going from a discrete category of high vs. low to a quantitative score can give users a wider range of options depending on the analyses they want to perform. For example, for finding differential gene expression among homoeologs, one may want to be conservative and take on those pairs with 90+ confidence. On the other hand, if scanning for all potential homoeologs that could provide disease resistance, some potential R-genes may look like highly repetitive TEs (

Between the polyploid species used in this study, there are genome specificities, biological as well as assembly-wise, which is why we see differences in terms of confidence scores. For example, wheat has the majority of its assembly still in scaffolds, which is why the peak of scores is around 70–80. These confidence scores will most likely increase when the assembly improves. However, by using the local synteny, our method at least allows us to make a confidence score using scaffolds.

Homoeologs aren’t always syntenic, in one-to-one copies, or with a low distance. Pairs on non-matching chromosomes may be homoeologs that represent single-gene translocations. Furthermore, some genes can evolve quickly, giving an abnormally high distance. Additionally, some genes might have a high copy number. These could be real genes that have a propensity for duplication (depends on the function, located in a recombination hotspot, gene balance hypothesis, etc.). It is important to not disregard these pairs in homoeolog inference methodology, as they could still represent interesting functions.

To assign confidence scores to inferred pairs of homoeologs, we introduced a fuzzy logic-based method combining evolutionary distance, local synteny, and cardinality of homoeology relationships. Even though there is a degree of subjectivity in defining the fuzzy rules, the resulting scores proved meaningful in how they correlate with the number of orthologs and in a manual inspection of a random subset of 30 instances. The framework constitutes a substantial improvement over the previous confidence score which was only based on global synteny and had much less granularity.

The table is divided into three parts: first 10 rows are homoeolog pairs randomly chosen from those inferred to have a Confidence Score below 60. The second part are 10 random pairs with a Confidence Score between 60-90. The last rows have a Confidence Score between 90–100. The first two columns contain the identifiers of the two genes in a homoeologous pair (both OMA IDs and source IDs). The remaining columns are: the chromosomes for each gene in the pair, the Synteny Score of the pair (see Materials and Methods), the Evolutionary Distance of the pair, the total number of orthologs for the pair (inferred by OMA), the average protein length of the pair, and domain hits of the sequence (from NCBI’s Conserved Domain Database).

Christophe Dessimoz is an Academic Editor for PeerJ.

The following information was supplied regarding data availability:

Data can be found in the OMA database (

Under “Other files”, select “OMA Browser database (as hdf5)”.