Collecting reliable clades using the Greedy Strict Consensus Merger
 Published
 Accepted
 Received
 Academic Editor
 Abhishek Kumar
 Subject Areas
 Bioinformatics, Computational Biology, Evolutionary Studies, Genetics, Taxonomy
 Keywords
 Consensus, Supertree, Supermatrix, Divide and Conquer, FlipCut, Phylogeny
 Copyright
 © 2016 Fleischauer and Böcker
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2016) Collecting reliable clades using the Greedy Strict Consensus Merger. PeerJ 4:e2172 https://doi.org/10.7717/peerj.2172 (
Abstract
Supertree methods combine a set of phylogenetic trees into a single supertree. Similar to supermatrix methods, these methods provide a way to reconstruct larger parts of the Tree of Life, potentially evading the computational complexity of phylogenetic inference methods such as maximum likelihood. The supertree problem can be formalized in different ways, to cope with contradictory information in the input. Many supertree methods have been developed. Some of them solve NPhard optimization problems like the wellknown Matrix Representation with Parsimony, while others have polynomial worstcase running time but work in a greedy fashion (FlipCut). Both can profit from a set of clades that are already known to be part of the supertree. The Superfine approach shows how the Greedy Strict Consensus Merger (GSCM) can be used as preprocessing to find these clades. We introduce different scoring functions for the GSCM, a randomization, as well as a combination thereof to improve the GSCM to find more clades. This helps, in turn, to improve the resolution of the GSCM supertree. We find this modifications to increase the number of true positive clades by 18% compared to the currently used Overlap scoring.
Introduction
Supertree methods are used to combine a set of phylogenetic trees with nonidentical but overlapping taxon sets, into a larger supertree that contains all the taxa of every input tree. Many supertree methods have been established over the years, see for example: BinindaEmonds (2004); Ross & Rodrigo (2004); Chen et al. (2006); Holland et al. (2007); Scornavacca et al. (2008); Ranwez, Criscuolo & Douzery (2010); Bansal et al. (2010); Snir & Rao (2010); Swenson et al. (2012); Brinkmeyer, Griebel & Böcker (2013); Berry, BinindaEmonds & Semple (2013); Gysel, Gusfield & Stevens (2013); Whidden, Zeh & Beiko (2014); these methods complement supermatrix methods which combine the “raw” sequence data rather than the trees (Von Haeseler, 2012).
In contrast to supermatrix methods, supertree methods allow us to analyze large datasets without constructing a multiple sequence alignment for the complete dataset, and without a phylogenetic analysis of the resulting alignment. In this context, supertree methods can be used as part of divideandconquer meta techniques (Huson, Nettles & Warnow, 1999; Huson, Vawter & Warnow, 1999; Roshan et al., 2004; Nelesen et al., 2012), which break down a large phylogenetic problem into smaller subproblems that are computationally much easier to solve. The results of the subproblems are then combined using a supertree method.
Constructing a supertree is easy if no contradictory information is encoded in the input trees (Aho et al., 1981). However, resolving conflicts in a reasonable and swift way remains difficult. Matrix Representation with Parsimony (MRP) (Baum, 1992; Ragan, 1992) is still the most widely used supertree method today, as the constructed supertrees are of comparatively high quality. Since MRP is NPhard (Foulds & Graham, 1982), heuristic search strategies have to be used. Swenson et al. (2012) introduced SuperFine which combines the Greedy Strict Consensus Merger (GSCM) (Huson, Vawter & Warnow, 1999; Roshan et al., 2003) with MRP. The basic idea is to use a very conservative supertree method (in this case GSCM) as preprocessing for betterresolving supertree methods (in this case MRP). Conservative supertree methods only resolve conflictfree clades and keep the remaining parts of the tree unresolved. We call those resolved parts of a conservative supertree reliable clades. Other betterresolving supertree methods, such as the polynomialtime FlipCut (Brinkmeyer, Griebel & Böcker, 2013) algorithm, may also benefit from this preprocessing.
The number of reliable clades returned by GSCM is highly dependent on the merging order of the source trees. Although the GSCM only returns clades that are compatible with all source trees, we find that it likewise produces clades which are not supported by any of the source trees (bogus clades). Obviously, bogus clades do not necessarily have to be part of the supertree.
With the objective of improving the GSCM as a preprocessing method, we introduce new scoring functions, describe a new randomized GSCM algorithm, and show how to combine multiple GSCM results. Our new scorings increase the number of true positive clades by 5% while simultaneously reducing the number of false positive clades by 2%. Combining different scoring functions and randomization further increases the number of true positive clades by up to 18%. We find that combining a sufficient number of randomized GSCM trees is more robust than a single GSCM tree.
We describe and implement a variant of the GCSM algorithm for rooted input trees and adapt the scoring functions used within SuperFine (Swenson et al., 2012). We find that our new scoring functions and modifications improve on the ones adapted from Swenson et al. (2012) in the rooted case. Although all scoring functions and modifications can be generalized to the unrooted case, the results may differ for unrooted trees.
All presented methods are part of our GSCM command line tool (https://bio.informatik.unijena.de/software/gscm/).
Methods
Preliminaries
In this paper, we deal with graph theoretical objects called rooted (phylogenetic) trees. Let $\mathcal{V}\left(T\right)$ be the vertex set. Every leaf of a tree T is uniquely labeled and called a taxon. Let $\mathcal{L}\left(T\right)\subset \mathcal{V}\left(T\right)$ be the set of all taxa in T. We call every vertex $v\in \mathcal{V}\left(T\right)\setminus \mathcal{L}\left(T\right)$ an inner vertex. An inner vertex $c\in \mathcal{V}\left(T\right)$ comprises a clade $C=\mathcal{L}\left({T}^{c}\right)\subseteq \mathcal{L}\left(T\right)$ where T^{c} is the subtree of T rooted at c. Two clades C_{1} and C_{2} are compatible if C_{1}∩C_{2} ∈ {C_{1}, C_{2}, ∅}. Two trees are compatible if all clades are pairwise compatible. The resolution of a rooted tree is defined as $\frac{\left\mathcal{V}\left(T\right)\right\left\mathcal{L}\left(T\right)\right}{\left\mathcal{L}\left(T\right)\right1}$. Hence, a completely unresolved (i.e., star) tree has resolution 0, whereas a fully resolved (i.e., binary) tree has resolution 1. For a given collection of trees $\mathcal{T}=\left\{{T}_{1},\dots ,{T}_{k}\right\}$, a supertree T of $\mathcal{T}$ is a phylogenetic tree with leaf set $\mathcal{L}\left(T\right)={\bigcup}_{{T}_{i}\in \mathcal{T}}\mathcal{L}\left({T}_{i}\right)$. A supertree T is called a consensus tree if for all input trees ${T}_{i},{T}_{j}\in \mathcal{T}$, $\mathcal{L}\left({T}_{i}\right)=\mathcal{L}\left({T}_{j}\right)$ holds. A strict consensus of $\mathcal{T}$ is a tree that only contains clades present in all trees ${T}_{i}\in \mathcal{T}$. A semistrict consensus of $\mathcal{T}$ contains all clades that appear in some input tree and are compatible with each clade of each ${T}_{i}\in \mathcal{T}$ (Bryant, 2003). For a set of taxa $X\subset \mathcal{L}\left(T\right)$, we define the Xinduced subtree of T, T_{X} as the tree obtained by taking the (unique) minimal subgraph T(X) of T that connects the elements of X and then suppressing all vertices with outdegree one: that is, for every inner vertex v with outdegree one, replace the adjacent edges (p, v) and (v, c) by a single edge (p, c) and delete v.
Strict consensus merger (SCM)
For a given pair of trees T_{1} and T_{2} with overlapping taxon sets, the SCM (Huson, Vawter & Warnow, 1999; Roshan et al., 2003) calculates a supertree as follows. Let $X=\mathcal{L}\left({T}_{1}\right)\cap \mathcal{L}\left({T}_{2}\right)$ be the set of common taxa and T_{1X} and T_{2X} the Xinduced subtrees. Calculate T_{X} = strictConsensus (T_{1X}, T_{2X}). Insert all subtrees, removed from T_{1} and T_{2} to create T_{1X} and T_{2X}, into T_{X} without violating any of the clades in T_{1} or T_{2}. If removed subtrees of T_{1} and T_{2} attach to the same edge e in T_{X}, a collision occurs. In that case, all subtrees attaching to e will be inserted at the same point by subdividing e and creating a polytomy at the new vertex (see Fig. 1).
Note that neither the strict consensus nor the collision handling inserts clades into the supertree T_{X} that conflict with any of the source trees.
Greedy Strict Consensus Merger (GSCM)
The GSCM algorithm generalizes the SCM idea to combine a collection $\mathcal{T}=\left\{{T}_{1},{T}_{2},\dots ,{T}_{k}\right\}$ of input trees into a supertree T with $\mathcal{L}\left(T\right)={\bigcup}_{i=1}^{k}\mathcal{L}\left({T}_{i}\right)$ by pairwise merging trees until only the supertree is left. Let score(T_{i}, T_{j}) be a function returning an arbitrary score of two trees T_{i} and T_{j}. At each step, the pair of trees that maximizes score(T_{i}, T_{j}) is selected and merged, resulting in a greedy algorithm. Since the SCM does not insert clades that contradict any of the source trees, the GSCM returns a supertree that only contains clades that are compatible with all source trees.
1:  function scm (tree T_{1}, tree T_{2}) 
2:  $X\leftarrow \mathcal{L}\left({T}_{1}\right)\cap \mathcal{L}\left({T}_{2}\right)$ 
3:  if X ≥ 3 then ▹Otherwise, the merged tree will be unresolved. 
4:  calculate T_{1X} and T_{2X} 
5:  T_{X}←strictConsensus (T_{1X}, T_{2X}) 
6:  for all removed subtrees of T_{1} and T_{2} do 
7:  if collision then ▹Subtrees of T_{1} and T_{2} attach to the same edge e in T_{X} (Fig. 1) 
8:  insert all colliding subtrees at the same point on e by generating a polytomy. 
9:  else 
10:  Reinsert subtree into T_{X} without violating any of the bipartitions in T_{1} or T_{2}. 
11:  end if 
12:  end for 
13:  return T_{X} 
14:  end if 
15:  end function 
1:  function pickOptimalTreePair (trees $\mathcal{S}\subseteq \left\{{T}_{1},{T}_{2},\dots ,{T}_{k}\right\}$) 
2:  Pick two trees $\left\{{T}_{i},{T}_{j}\right\}\subseteq \mathcal{S}$ which maximize score(T_{i}, T_{j}) 
3:  return T_{i}, T_{j} 
4:  end function 
1:  function gscm (trees {T_{1}, T_{2}, …, T_{k}}) 
2:  $\mathcal{S}\leftarrow \left\{{T}_{1},{T}_{2},\dots ,{T}_{k}\right\}$ 
3:  while $\left\mathcal{S}\right\ge 2$ do 
4:  T_{i}, T_{j}←pickOptimalTreePair$\left(\mathcal{S}\right)$ 
5:  $\mathcal{S}\leftarrow \mathcal{S}\setminus \left\{{T}_{i},{T}_{j}\right\}$ 
6:  T_{scm}← SCM(T_{i}, T_{j}) 
7:  $\mathcal{S}\leftarrow \mathcal{S}\cup \left\{{T}_{scm}\right\}$ 
8:  end while 
9:  return T_{scm} 
10:  end function 
Tree merging order
Although the SCM for two trees is deterministic, the output of the GSCM is influenced by the order of selecting pairs of trees to be merged, since the resulting number and positions of collisions may vary.
Let T_{1}, …, T_{n} be a set of input trees we want to merge into a supertree using the GSCM. When merging two trees, the strict consensus merger (SCM) accepts only clades, that can be safely inferred from the two source trees. In case of a collision during reinsertion of unique taxa, the colliding subtrees are inserted as a polytomy on the edge where the collision occurred.
If collisions of different merging steps occur on the same edge, the polytomy created by the first collision may cause the following collisions to not occur. Such obviated collisions induce bogus clades (see Fig. 2) which cannot be inferred unambiguously from the source trees and hence should not be part of the supertree. A clade C of a supertree T = GSCM(T_{1}, …, T_{n}) is a bogus clade if there is another supertree T′ = GSCM(T_{1}, …, T_{n}) (based on a different tree merging order) that contains a clade C′ conflicting with C (see Figs. 2A and 2C). Note that bogus clades cannot be recognized by comparison to the source trees since they do not conflict with any of the source trees T_{1}, …, T_{n}. All clades in the GSCM supertree that are not bogus, are called reliable clades.
Because of these bogus clades the GSCM supertree with the highest resolution may not be the best supertree. To use the GSCM as preprocessing for other supertree methods, it is important to prevent bogus clades. Clades resulting from the preprocessing are fixed and will definitely be part of the final supertree (even if they are wrong). To use GCSM as an efficient preprocessing we want to determine a preferably large amount of the existing reliable clades. Therefore, we searched for scoring functions that maximize the number of reliable clades by simultaneously minimizing the number of bogus clades.
Scoring functions
We present three novel scoring functions that produce high quality GSCM supertrees with respect to ${\mathit{F}}_{1}\text{score}$ and number of unique clades (unique in terms of not occurring in a supertree resulting from any of the other scorings). In addition, we use the original Resolution scoring (Roshan et al., 2003), as well as the UniqueTaxa and Overlap scorings (Swenson et al., 2012).
Let $uc\left(T,{T}^{\prime}\right)=\mathcal{V}\left({T}_{\mathcal{L}\left(T\right)\setminus \mathcal{L}\left({T}^{\prime}\right)}\setminus \mathcal{L}\left(T\right)\right)$ be the set of unique clades of T compared to T′.
UniqueCladesLost scoring: minimizing the number of unique clades that get lost: $score\left({T}_{i},{T}_{j}\right)=\left(\left(\leftuc\left({T}_{i},{T}_{j}\right)\right\leftuc\left(scm\left({T}_{i},{T}_{j}\right),{T}_{j}\right)\right\right)+\left(\leftuc\left({T}_{j},{T}_{i}\right)\right\leftuc\left(scm\left({T}_{i},{T}_{j}\right),{T}_{i}\right)\right\right)\right).$ UniqueCladeRate scoring: maximizing the number of preserved unique clades: $\frac{\leftuc\left({T}_{i},{T}_{j}\right)\right+\leftuc\left({T}_{j},{T}_{i}\right)\right}{\leftuc\left(scm\left({T}_{i},{T}_{j}\right),{T}_{i}\right)\right+\leftuc\left(scm\left({T}_{i},{T}_{j}\right),{T}_{j}\right)\right}.$ Collision scoring: minimizing the number of collisions: $score\left({T}_{i},{T}_{j}\right)=\left(\text{number of edges in}\phantom{\rule{1em}{0ex}}\text{SCM}\left({T}_{i},{T}_{j}\right)\phantom{\rule{1em}{0ex}}\text{where a collision occured}\right).$ Unique Taxa scoring (Swenson et al., 2012): minimizing the number of unique taxa: $score\left({T}_{i},{T}_{j}\right)=\left\mathcal{L}\left({T}_{i}\right)\mathrm{\Delta}\mathcal{L}\left({T}_{j}\right)\right.$ Overlap scoring (Swenson et al., 2012): maximizing the number of common taxa: $score\left({T}_{i},{T}_{j}\right)=\mathcal{L}\left({T}_{1}\right)\cap \mathcal{L}\left({T}_{2}\right).$ Resolution scoring (Roshan et al., 2003): maximizing the resolution of the SCM tree: $score\left({T}_{i},{T}_{j}\right)=\frac{\left\mathcal{V}\left(\text{SCM}\left({T}_{i},{T}_{j}\right)\right)\right\left\mathcal{L}\left(\text{SCM}\left({T}_{i},{T}_{j}\right)\right)\right}{\left\mathcal{L}\left(\text{SCM}\left({T}_{i},{T}_{j}\right)\right)\right1}.$
Combining multiple scorings
In general, supertrees created with the GSCM using different scoring functions contain different clades. To collect as many reliable clades as possible, we compute several GSCM supertrees using different scoring functions and combine them afterwards.
Reliable clades of all possible GSCM supertrees for a given set of source trees are pairwise compatible. In contrast, bogus clades can be incompatible among different GSCM supertrees (see Fig. 2). Thus, every conflicting clade has to be a bogus clade. By removing incompatible clades we only eliminate bogus clades but none of the reliable clades from our final supertree.
Eliminating bogus clades while assembling reliable clades is done using a semistrict consensus algorithm (Bryant, 2003). It should be noted that bogus clades are only eliminated if they induce a conflict between at least two supertrees (see Fig. 2). Hence, there is no guarantee to eliminate all bogus clades.
Combined scoring: Let Combined3 be the combination of the Collision, UniqueCladeRate and UniqueCladesLost scoring functions. Furthermore Combined5 combines the Collision, UniqueCladeRate, UniqueCladesLost, Overlap and UniqueTaxa scoring functions.
Randomized GSCM
Generating many different GSCM supertrees increases the probability of both detecting all reliable clades and eliminating all bogus clades. To generate a larger number of GSCM supertrees, randomizing the tree merging order of the GSCM algorithm may be more suitable than using a variety of different tree selection scorings. To this end, we replace picking an optimal pair of trees (see Algorithm 2) by picking a random pair of trees (see Algorithm 3).
1:  function pickRandomTreePair (trees $\mathcal{S}\subseteq \left\{{T}_{1},{T}_{2},\dots ,{T}_{k}\right\}$) 
2:  Randomly pick a pair of trees $\left\{{T}_{i},{T}_{j}\right\}\subseteq \mathcal{S}$ with probability 
$P\left({T}_{i},{T}_{j}\right)=\frac{score\left({T}_{i},{T}_{j}\right)}{{\sum}_{{T}_{a},{T}_{b}\in \mathcal{S},a\ne b}score\left({T}_{a},{T}_{b}\right)},i\ne j$  
3:  return T_{i}, T_{j} 
4:  end function 
Running the randomized GSCM for different scoring functions multiple (k) times allows us to generate a large number of supertrees containing different clades. The resulting trees are combined using a semistrict consensus as described in the previous section. For combined scorings (Combinedn) with n different scoring functions we calculate $\frac{k}{n}$ supertrees for each of the scoring functions and combine all k supertrees using the semistrict consensus.
Experimental Setup
Dataset
To evaluate the different modifications of the GSCM algorithm we simulate a rooted dataset which is based on the SMIDGen protocol (Swenson et al., 2010) called SMIDGenOG.
We generate 30 model trees with 1,000 (500/100) taxa. For each model tree, we generate a set of 30 (15/5) cladebased source trees and four scaffold source trees containing 20%, 50%, 75%, or 100% of the taxa in the model tree (the scaffold density). We set up four different source tree sets: each of them containing all cladebased trees and one of the scaffold trees, respectively.
The SMIDGen protocol follows data collection processes used by systematists when gathering empirical data, e.g., the creation of several denselysampled cladebased source trees, and a sparselysampled scaffold source tree. All source trees are rooted using an outgroup. Unless indicated otherwise, we strictly follow the protocol of Swenson et al. (2010), see there for more details:

1.
Generate model trees. We generate model trees using r8s (Sanderson, 2003) as described by Swenson et al. (2010). To each model tree, we add an outgroup. The branch to the outgroup gets the length of the longest path in the tree, plus a random value between 0 and 1. This outgroup placement guarantees that there exists an outgroup for every possible subtree of the model tree.

2.
Generate sequences. Universal genes appear at the root of the model tree and do not go extinct. We simulate five universal genes along the model tree. Universal genes are used to infer scaffold trees. To simulate nonuniversal genes, we use a gene “birth–death” process (as described by Swenson et al. (2010)) to determine 200 subtrees (one for each gene) within the model tree for which a gene will be simulated. For comparison, the SMIDGen dataset evolves 100 nonuniversal genes. Simulating a higher number of genes increases the probability to find a valuable outgroup. Genes (both universal and nonuniversal) are simulated under a GTR + Gamma + Invariable Sites process along the respective tree, using SeqGen (Rambaut & Grassly, 1997).

3.
Generate source alignments. To generate a cladebased source alignment, we select a clade of interest from the model tree using a “birth” node selection process (as described by Swenson et al. (2010)). For each clade of interest, we select the three nonuniversal gene sequences with the highest taxa coverage to build the alignment. For each source alignment, we search in the model tree for an outgroup where all three nonuniversal genes are present and add it to the alignment.
To generate a scaffold source alignment, we randomly select a subset of taxa from the model tree with a fixed probability (scaffold factor) and use the universal gene sequences.

4.
Estimation of source trees. We estimate Maximum Likelihood (ML) source trees using RAxML with GTRGAMMA default settings and 100 bootstrap replicates. We root all source trees using the outgroup, and remove the outgroups afterwards.
Evaluation
To evaluate the accuracy of tree reconstruction methods on simulated data, a widespread method is calculating the rates of false negative ($\mathit{FN}$) clades and false positive ($\mathit{FP}$) clades between an estimated tree (supertree) and the corresponding model tree. $\mathit{FN}$ clades are in the model tree but not in the supertree. $\mathit{FP}$ clades are in the supertree but not in the model tree.
$\mathit{FN}$rates and $\mathit{FP}$rates contain information on the resolution of the supertree. Model trees are fully resolved. If it happens that the supertree is fully resolved too, we get $\mathit{FN}$rate = $\mathit{FP}$rate. Otherwise, if $\mathit{FN}$rate > $\mathit{FP}$rate the supertree is not fully resolved. Clades in the supertree that are not $\mathit{FP}$s are true positive ($\mathit{TP}$) clades.
As mentioned above, we try to improve the GSCM as a preprocessing method and thus want to maximize the number of $\mathit{TP}$, while keeping the number of $\mathit{FP}$ minimal. This is reflected in the ${\mathit{F}}_{1}\text{score}$: ${F}_{1}=\frac{2TP}{2TP+FP+FN}.$ We measure the statistical significance of differences between the averaged ${\mathit{F}}_{1}\text{score}$s by the Wilcoxon signedrank test with α = 0.05. We calculate the pairwise pvalues for all 16 scoring functions (including combined scorings and randomized scorings with 400 iterations). This leads to $\frac{1{6}^{2}16}{2}=120$ significance tests. Respecting the multiple testing problem we can accept pvalues below $\frac{0.005}{120}\approx 0.0004$ (Bonferroni correction). The complete tables can be found in Tables S1, S3 and S5.
Furthermore, Tables S2, S4 and S6 contain the number of times that each scoring function outperforms each other scoring function. Ties are reported as well.
Results and Discussion
We find the influence of scoring functions and randomization to increase with the size of the input data (as expected for greedy algorithms). Thus, in the further evaluation we only consider the larger (1,000 taxa) dataset. However, the overall effects are similar for all datasets. For the results of the 500 and 100 taxa datasets, we refer to Figs. S1–S16.
The scaffold factor highly influences the quality of the supertrees (see Figs. 3 and 4). In general, all scoring functions profit from a large scaffold tree. In particular, for a scaffold factor of 100% nearly all scorings perform equally well and better than for all other scaffold factors. A source tree that already contains all taxa simplifies the supertree computation for the GSCM algorithm. Starting with the scaffold tree and merging the remaining source trees in arbitrary order leads to the optimal solution. No collision can occur, when the taxon set of one tree is a subset of the taxon set of the other tree. However, the Resolution and UniqueTaxa scoring functions do not necessarily pick the scaffold tree in the first step and therefore do not necessarily lead to an optimal solution. In contrast, the Overlap scoring—which does not perform well for small scaffold tree sizes (20%, 50%)—produces optimal solutions for a scaffold factor of 100%.
Comparing the different scoring functions, we find that in general, the $\mathit{FN}$rate varies more than the $\mathit{FP}$rate (see Fig. 3). Our presented scoring functions (Collision, UniqueCladeLost, UniqueCladeRate) decrease the $\mathit{FN}$rate, without increasing the $\mathit{FP}$rate (see Fig. 3). This leads to the highest ${\mathit{F}}_{1}\text{score}$s for all scaffold factors (see Fig. 4A). They clearly outperform the Resolution, Overlap and UniqueTaxa scorings for scaffold factors 50% and 75%. The differences in the ${\mathit{F}}_{1}\text{score}$s are significant (pvalues below 0.000033). For a scaffold factor of 20% the improvements of our scoring functions in comparison to UniqueTaxa are not significant. For a scaffold factor of 100% the Overlap scoring function is on par with our scoring functions (all of them will return the optimal solution). The differences between Collision, UniqueCladeLost and UniqueCladeRate are not significant. Nevertheless UniqueCladeLost provides the most robust and input independent results. For scaffold factors of 20% and 50%, Resolution and Overlap show significantly worse (pvalues ≤ 0.000006) ${\mathit{F}}_{1}\text{score}$s than all other scoring functions (see Fig. 4A). There is no significant difference (pvalues > 0.09) between Resolution and Overlap scoring. For scaffold factors of 75% and 100%, the Resolution scoring function performs significantly worse than all others. For a scaffold factor of 75%, there is no significant difference between UniqueTaxa and Overlap scoring. For a scaffold factor of 100%, the Overlap scoring function performs better than UniqueTaxa, which is still significantly better than Resolution.
Even for equallyperforming scoring functions, the resulting trees are often different (except for scaffold factor 100%). Thus, we combine the GSCM supertrees computed with different scorings using the semistrict consensus. Since the Resolution scoring function performs badly, we only combine the remaining five scoring functions. The combination of different scoring functions strongly improves the $\mathit{FN}$rate. Thus, the combined supertrees have improved ${\mathit{F}}_{1}\text{score}$s for all scaffold densities (see Fig. 4B). The combination of Collision, UniqueCladeLost, UniqueCladeRate, Overlap and UniqueTaxa (Combined5) results in the best ${\mathit{F}}_{1}\text{score}$. However, Combined5 has a significantly worse $\mathit{FP}$rate than all other scorings. In contrast, the combination of Collision, UniqueCladeLost, UniqueCladeRate scoring (Combined3) shows no significant decline of the $\mathit{FP}$rate.
To collect as many $\mathit{TP}$ clades as possible, we use a randomized tree merging order generating multiple (k) supertrees which are combined using the semistrict consensus. Generally we found that randomization further improves the ${\mathit{F}}_{1}\text{score}$ in comparison to the single scoring functions (see Fig. 4D). Compared to the Combined5 scoring there is only an improvement of the ${\mathit{F}}_{1}\text{score}$ for scaffold factors of 50% and 75%. Again, these improvements come with a significant increase of the $\mathit{FP}$rate.
Already for 25 random iterations, all presented scoring functions perform on almost the same level (see Fig. 4C). As the number of random iterations increases, the difference between the reported scoring functions vanishes.
Conclusion
We found that collisions not only destroy source tree clades but also introduce bogus clades to the supertree. Thus, the scoring functions that minimize the number of collisions perform best. Combining multiple GSCM supertrees using a semistrict consensus method helps to better resolve the supertree.
We presented three novel scoring functions (Collision, UniqueCladesLost, UniqueCladeRate) that increase the number of true positive clades and decrease the number of false positive clades of the resulting supertree. UniqueCladesLost score is the overall bestperforming scoring function.
Combining the supertrees calculated by these three scorings using a semistrict consensus algorithm further increases the number of true positive clades without a significant increase of the false positives.
For almost all presented scoring functions, the highest ${\mathit{F}}_{1}\text{score}$s and best resolved trees are achieved using randomized GSCM. Randomization indeed increases the number of true positive clades but also significantly increases false positive clades. Thinking of GSCM as a preprocessing method, those false positive clades will have a strongly negative influence on the quality of the final supertree.
Depending on the application, “best performance” is characterized differently. The most conservative approach is our UniqueCladeLost scoring function which increases the $\mathit{TP}$rate by 5% while decreasing the $\mathit{FP}$rate by 2% compared to Overlap. To use GSCM as a preprocessing method, we recommend a combination of Collision, UniqueCladeLost and UniqueCladeRate (Combined3) scoring. In comparison to the Overlap scoring function, this increases the number of true positive clades by 9% without a significant increase of false positive clades. The overall best ratio of true positive and false positive clades can be achieved with a combination of randomized Collision, UniqueCladeLost, UniqueCladeRate, Overlap and UniqueTaxa (Combined5) scoring.
All presented methods are part of our platformindependent GSCM command line tool (https://bio.informatik.unijena.de/software/gscm/).
Supplemental Information
1,000 taxa F1score pvalues
Statistical significance (pvalues) of differences between the averaged F1scores by Wilcoxon signedrank test for the 1000 taxa dataset. Respecting the multiple testing problem we can accept pvalues below 0.0004 for a significance level of 0.05
1,000 taxa F1score wins and ties
Number of replicates for each scoring function outperforming another scoring function. Comparison by F1score on the 1,000 taxa dataset. Ties are reported in parentheses.
500 taxa F1score pvalues
Statistical significance (pvalues) of differences between the averaged F1scores by Wilcoxon signedrank test for the 500 taxa dataset. Respecting the multiple testing problem we can accept pvalues below 0.0004 for a significance level of 0.05
500 taxa F1score wins and ties
Number of replicates for each scoring function outperforming another scoring function. Comparison by F1score on the 500 taxa dataset. Ties are reported in parentheses.
100 taxa F1score pvalues
Statistical significance (pvalues) of differences between the averaged F1scores by Wilcoxon signedrank test for the 100 taxa dataset. Respecting the multiple testing problem we can accept pvalues below 0.0004 for a significance level of 0.05
100 taxa F1score wins and ties
Number of replicates for each scoring function outperforming another scoring function. Comparison by F1score on the 100 taxa dataset. Ties are reported in parentheses.
500 taxa FNrate (A) and FPrate (B) for single and combined scorings
FNrates (A) and FPrates (B) for single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
500 taxa FNrate for single and combined scorings
FNrates of single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
500 taxa FPrate
FPrates of single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
500 taxa F1scores for all scoring functions
F1scores (a high score is good) of (A) different scoring functions (Overlap, UniqueTaxa, Resolution, Collision, UniqueCladesLost, UniqueCladesRate) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The error bars show the standard error. (B) different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error. (C) different single scoring functions (Overlap, Collision, UniqueCladesLost) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error. (D) different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
500 taxa F1score for single scorings
F1scores (a high score is good) of different scoring functions (Overlap, UniqueTaxa, Resolution, Collision, UniqueCladesLost, UniqueCladesRate) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The error bars show the standard error.
500 taxa F1score for combined scorings
F1scores (a high score is good) of different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
500 taxa F1score for randomized GSCM
F1scores (a high score is good) of different single scoring functions (Overlap, Collision, UniqueCladesLost) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error.
500 taxa F1score conclusion
F1scores (a high score is good) of different scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, Combined5) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error.
100 taxa FNrate (A) and FPrate (B) for single and combined scorings
FNrates (A) and FPrates (B) for single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 500 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
100 taxa FNrate for single and combined scorings
FNrates of single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
100 taxa FPrate for single and combined scorings
FPrates of single scorings (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3, Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
100 taxa F1scores for all scoring functions
F1scores (a high score is good) of (A) different scoring functions (Overlap, UniqueTaxa, Resolution, Collision, UniqueCladesLost, UniqueCladesRate) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The error bars show the standard error. (B) different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error. (C) different single scoring functions (Overlap, Collision, UniqueCladesLost) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error. (D) different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
100 taxa F1score for single scorings
F1scores (a high score is good) of different scoring functions (Overlap, UniqueTaxa, Resolution, Collision, UniqueCladesLost, UniqueCladesRate) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The error bars show the standard error.
100 taxa F1score for single and combined scorings
F1scores (a high score is good) of different single scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, UniqueCladeRate) and their combinations (Combined3,Comnbined5) for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The error bars show the standard error.
100 taxa F1score for randomized GSCM
F1scores (a high score is good) of different single scoring functions (Overlap, Collision, UniqueCladesLost) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error.
100 taxa F1score conclusion
F1scores (a high score is good) of different scoring functions (Overlap, UniqueTaxa, Collision, UniqueCladesLost, Combined5) with and without randomization for all scaffold factors (20%, 50%, 75%, 100%) of the 100 taxa dataset. The Combined scorings are the semi strict consensus of the supertrees calculated by the respective scorings. The integer value behind the keyword “Rand” represents the number of randomized iterations. The error bars show the standard error.