Fast and accurate estimation of the covariance between pairwise maximum likelihood distances

Manuel Gil

doi:10.7717/peerj.583

Fast and accurate estimation of the covariance between pairwise maximum likelihood distances

Manuel Gil

Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland

Swiss Institute of Bioinformatics, Lausanne, Switzerland

DOI: 10.7717/peerj.583

Published: 2014-09-25
Accepted: 2014-08-28
Received: 2014-05-16

Academic Editor: Maria Anisimova

Subject Areas: Computational Biology, Mathematical Biology, Molecular Biology, Statistics, Computational Science
Keywords: Maximum likelihood, Pairwise distance, Covariance, Correlation, Alignment

Copyright: © 2014 Gil
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Gil M. 2014. Fast and accurate estimation of the covariance between pairwise maximum likelihood distances. PeerJ 2:e583 https://doi.org/10.7717/peerj.583

The author has chosen to make the review history of this article public.

Abstract

Pairwise evolutionary distances are a model-based summary statistic for a set of molecular sequences. They represent the leaf-to-leaf path lengths of the underlying phylogenetic tree. Estimates of pairwise distances with overlapping paths covary because of shared mutation events. It is desirable to take these covariance structure into account to increase precision in any process that compares or combines distances. This paper introduces a fast estimator for the covariance of two pairwise maximum likelihood distances, estimated under general Markov models. The estimator is based on a conjecture (going back to Nei & Jin, 1989) which links the covariance to path lengths. It is proven here under a simple symmetric substitution model. A simulation shows that the estimator outperforms previously published ones in terms of the mean squared error.

Introduction

Phylogenetic trees are one of the most important representations of the evolutionary relationship between homologous genomic sequences. Their relatedness can be summarized by a set of pairwise evolutionary distances representing the leaf-to-leaf path lengths of the underlying tree. Such distances are usually estimated by maximum likelihood (ML) assuming a Markovian model of character substitution (Yang, 2006).

Besides substitutions, a process of insertions and deletions of sequence fragments plays a major role in the evolution of molecular sequences. As a consequence, homologous characters—i.e., the ones related by substitutions only—have to be identified prior to distance estimation. A consistent hypothesis of character homology is provided by multiple sequence alignments (MSAs). However, they are hard to compute, with non-trivial scoring schemes leading to NP complete problems (Wang & Jiang, 1994; Just, 2001; Elias, 2006). Alternatively, the sequences can be aligned pairwise, for instance, by dynamic programming to obtain optimal pairwise alignments (OPAs) in quadratic time in the length of the input sequences (Needleman & Wunsch, 1970).

Pairwise distance methods are generally faster and also simpler than likelihood based approaches that operate directly on sequence data. For that reason, they have often been chosen as an input to large-scale genomic and phylogenetic analyses. Further, distance tree methods are used to produce starting trees for ML tree estimation from MSAs (Guindon et al., 2010; Stamatakis, 2014; Vinh & Von Haeseler, 2004; Gil et al., 2013) and guide trees in progressive MSA methods (e.g., Löytynoja & Goldman, 2008; Katoh et al., 2005).

The speed benefits may affect accuracy due to a potential loss of information involved in the reduction of the sequence data (Steel, Hendy & Penny, 1988). However, this idea has recently been challenged in the context of tree estimation (Roch, 2010). Roch proposed to take advantage of higher order information using the correlations among the pairwise distances, which result from common mutation events on shared paths of the underlying tree. Indeed, most current practical distance tree methods assume statistical independence and do not account for distance covariance (Mihaescu & Pachter, 2008). The BioNJ algorithm (Gascuel, 1997), which uses a first-order model of covariance, is a notable exception. Generally, any process that compares or combines distance profits from a higher precision when the covariance structure is taken into account.

Estimators for the covariance between pairwise ML distance estimates have been proposed for certain mechanistic substitution models (Tajima & Nei, 1984; Nei & Jin, 1989; Bulmer, 1989) and for general Markov models by Susko (2003). Susko’s estimator requires an MSA and has a linear time complexity in the sequence length. Dessimoz & Gil (2008) have derived an adaptation for OPAs with similar complexity. This paper introduces a constant time estimator for general Markov models, applicable to OPAs and MSAs. To this end, a conjecture (from Nei & Jin, 1989; Bulmer, 1989) for a simple symmetric substitution model is proven and then extended to general models. In a simulation, the estimator is evaluated and compared to previously published ones.

Methods

Preliminaries

Denote $A = {\{x_{i}\}}_{i = 1}^{n}$ a pairwise alignment consisting of n homologous i.i.d. character-pairs x_i (e.g., nucleotides, amino acids, or codons, but no insertion–deletions). The likelihood of having the two sequences in A separated by an evolutionary distance d is (Felsenstein, 1981) (1) $L (A | d) = \prod_{i = 1}^{n} p (x_{i}, d),$ where p(x_i, d) is the probability of the character-pair x_i at d. The ML estimator of the true distance d_t is (2) $\hat{d} = \underset{d \geq 0}{arg max} L (A | d) .$ While for simple mechanistic substitution models the maximum can be expressed analytically, it is usually found numerically for empirical and complex mechanistic models using the Newton–Raphson method. Let I_n(d) denote the Fisher information for d, i.e., (3) $I_{n} (d) = - n E [\frac{\partial^{2}}{\partial d^{2}} log p (X, d)],$ where X is a random variable of which the x_i are realizations. The asymptotic variance of $\hat{d}$ is provided by standard theory (e.g., Pawitan, 2001) (4) $V (\hat{d}) = I_{n} {(d_{t})}^{- 1} .$ It can be estimated by evaluating the inverse of the Fisher information at $\hat{d}$ (hereafter ML variance) (5) $\hat{V} (\hat{d}) = I_{n} {(\hat{d})}^{- 1},$ or, alternatively, from A and $\hat{d}$ by a sample average: (6) ${\hat{V}}_{A} (\hat{d}) = - \frac{1}{n} {({\frac{1}{n} \sum_{i = 1}^{n} \frac{\partial^{2}}{\partial d^{2}} log p (x_{i}, d)|}_{d = \hat{d}})}^{- 1} = - {({\sum_{i = 1}^{n} \frac{\partial^{2}}{\partial d^{2}} log p (x_{i}, d)|}_{d = \hat{d}})}^{- 1} .$

Figure 1: Unrooted tree relating six sequences.
The labeled sequences {i, j, k, l} define a subtree of four sequences, a quartet. The δ’s indicate the branch lengths of the quartet, representing number of character changes. Under Markovian substitution models the ML estimates of the pairwise distances *d_ij* = *δ_i* + *δ_m* + *δ_j* and *d_kl* = *δ_k* + *δ_m* + *δ_l* covary because of the *δ_m* common mutation events.

Download full-size image

DOI: 10.7717/peerj.583/fig-1

Dessimoz & Gil (2008) distinguished three topological relations relevant for covariance estimation between any two pairwise distance estimates. First, the relation dependence, where two distances share some common evolution (e.g., d_ij and d_kl in Fig. 1). Second, the similar relation triplet, where two distances additionally share a sequence (e.g., d_ij and d_jk). Third, the case independence, where the distances are independent (e.g., d_ik and d_jl). Note that the second case can be conceptually reduced to the first (with j = l and δ_j = δ_l = 0). Further, we assume that mutation events on different edges of a tree are independent, thus, the distances in the independence-case have zero covariance. Therefore, our derivations will focus on the dependence-case without loss of generality.

General ML theory provides covariance estimates if all unknown parameters are estimated jointly. For instance, in ML tree reconstruction, the variance/covariance matrix can be estimated from the observed Fisher information matrix. However, the estimation of the pairwise distances considered here is done separately for each distance so that general ML theory can not be applied. Susko (2003) has derived an interesting estimator for the covariance of two distances ${\hat{d}}_{i j}$ and ${\hat{d}}_{k l}$ estimated from pairwise alignments which are induced by an MSA. It is a sample average based on the following expression (hereafter Susko-covariance) (7) $cov ({\hat{d}}_{i j}, {\hat{d}}_{k l}) = n V_{i j} V_{k l} E [{\frac{\partial}{\partial d} log p_{i j} (X, d)|}_{d = d_{t, i j}} \cdot {\frac{\partial}{\partial d} log p_{k l} (X, d)|}_{d = d_{t, k l}}] .$ Here, d_t,ij and d_t,kl are the true distances, the random-variable X stands for quartets of homologous characters (as opposed to pairs in Eq. (3)), and p_ij(X, d) denotes the probability for the pair {i, j} in X at the distance d. The Susko-covariance has two limitations. First, it requires an MSA, i.e., it is not applicable to distances derived from OPAs (for a discussion see Dessimoz & Gil, 2008). Second, the O(n) computation time may become prohibitive in large scale studies. Alternatively, a nonparametric bootstrap can be used (Efron & Tibshirani, 1993), but it takes substantially longer computation times and requires an MSA too.

Previously, two estimators have been presented do tackle the limitations. They work with both MSAs and OPAs. The first method is a numerical approximation to the variance of the difference between two distances involving a common sequence. It runs in constant time with respect to the sequence-length and takes the following form (Dessimoz et al., 2006) (8) ${\hat{V}}_{N U M} ({\hat{d}}_{i j} - {\hat{d}}_{j k}) = \frac{{\hat{d}}_{i k}^{ϕ_{1}}}{\hat{V} {({\hat{d}}_{i k})}^{ϕ_{2}}} \cdot \frac{{[\hat{V} ({\hat{d}}_{i j}) + \hat{V} ({\hat{d}}_{j k})]}^{ϕ_{3}} {[\hat{V} ({\hat{d}}_{i j}) \hat{V} ({\hat{d}}_{j k}))]}^{ϕ_{4}}}{{({\hat{d}}_{i j} + {\hat{d}}_{j k})}^{ϕ_{5}}},$ where the ϕ_i are parameters optimised a priori for a particular substitution model. This leads (through V(X−Y) = V(X) + V(Y)−2cov(X, Y)) to a fast covariance estimator for the triplet-case (hereafter referred to as triplet-covariance) (9) $cov ({\hat{d}}_{i j}, {\hat{d}}_{j k}) = \frac{1}{2} [\hat{V} ({\hat{d}}_{i j}) + \hat{V} ({\hat{d}}_{j k}) - {\hat{V}}_{N U M} ({\hat{d}}_{i j} - {\hat{d}}_{j k})] .$ The second method is based on Susko’s theory and shares the linear time complexity (Dessimoz & Gil, 2008, hereafter anchor-covariance). It was specifically designed to bypass the problem of inconsistent homology inference between OPAs using the concept of anchors—a globally consistent subset of aligned character pairs. In this paper, we propose a fast and general approach we term branch-covariance. It is motivated by an analytic result obtained under the simple r-state symmetric model.

r-state symmetric model

To obtain analytic results we will work with the r-state symmetric model, also know as the N_r model (Neyman, 1971). It is a generalization of the Jukes–Cantor model (Jukes & Cantor, 1969), which has four states (r = 4), to r character-states. The N_r model assumes a uniform distribution of states at the root, and equal rates of transitions between any two distinct character states. The probability to observe a mutation after time t is (10) $p_{m} (t) = β (1 - e^{- \frac{α t}{β}}), β = \frac{r - 1}{r},$ where α is the total rate of substitution. Thus, if two sequences are separated by t, the distance between them will be d = αt.

Because of the symmetries in the model, the number of differing sites I in a given pairwise alignment of length n is a sufficient statistics for the pairwise ML distance (11) $\hat{d} = - β ln (1 - \frac{I}{n β}) .$

An estimator for the variance of $\hat{d}$ can be obtained by applying Eq. (4) derived from the likelihood function. Alternatively, for models estimating distances from proportions of differing sites, the variance can be approximated by the Delta technique. This has been done by Kimura & Ohta (1972) for r = 4 and generalized by Tajima & Nei (1984) to (12) $\hat{V} (\hat{d}) = β [(1 - β) e^{\frac{2 d}{β}} + (2 β - 1) e^{\frac{d}{β}} - β] / n .$

We are going to use the Delta technique to derive an estimator for the covariance of two N_r ML distances d_ij = δ_i + δ_m + δ_j and d_kl = δ_k + δ_m + δ_l in the dependence-case (Fig. 1). Nei & Jin (1989) used an informal argument to propose the following expression with β = 3/4: (13) $cov ({\hat{d}}_{i j}, {\hat{d}}_{k l}) = β [(1 - β) e^{\frac{2 δ_{m}}{β}} + (2 β - 1) e^{\frac{δ_{m}}{β}} - β] / n .$ The equation originates from the assumption, that the covariance of two distance estimates with an underlying shared path length δ_m, is formally equivalent to the variance (Eq. (12)) of an ML estimate of a pairwise distance δ_m. Indeed, Bulmer (1989) presented a proof for β = 3/4 for the triplet-case and conjectured that Eq. (13) with j = l and δ_j = δ_l = 0 is true for any β. To the best of our knowledge Eq. (13) has not been proven yet in its most general form, i.e., for the dependence-case and any β.

We will first compute (14) $cov (I_{i j}, I_{k l}) = n cov (S_{i j}, S_{k l}) = n E [S_{i j} S_{k l}] - n E [S_{i j}] E [S_{k l}],$ where S_ij is a random variable indicating whether sequences i and j are identical (S_ij = 0) or different (S_ij = 1) at a particular site. Subsequently, we will apply the Delta method to obtain Eq. (13) from cov(I_ij, I_kl). We start by noting that (15) $E [S_{i j}] = Pr (S_{i j} = 1) = p_{m} (d_{i j}) .$ Therefore, the problem reduces to computing (16) $E [S_{i j} S_{k l}] = Pr (S_{i j} = 1 \land S_{k l} = 1) .$ In the following we will represent the quartet from Fig. 1 by the symbol , where the terminal node of the upper left branch corresponds to i. Furthermore, we are going to mark a branch with ○ if a particular site in the evolving sequence is in a different state at the endpoints of the branch. As an example, we look at the pattern . Here, the site in question changed its state on the branches leading to {j, k, l} but did not change its state on the branch leading to i and on the middle branch. A particular labeling of the nodes with characters from the alphabet of N_r for the given pattern has probability

For this pattern, there are r(r−1)³ possible labellings, of which only r(r−1)²(r−2) satisfy S_ij = 1∧S_kl = 1. Symmetrical mutation patterns, like for example have the same number of labelings leading to S_ij = 1∧S_kl = 1 but different mutation probabilities. We consider now all the patterns (grouped by symmetry) and corresponding labelings for the desired condition:

Using Maple (script in Supplemental Materials) we find that the expression simplifies to (19) $E [S_{i j} S_{k l}] = β (1 - β e^{- \frac{d_{i j}}{β}} - β e^{- \frac{d_{k l}}{β}} + (1 - β) e^{- \frac{d_{i j} + d_{k l} - 2 δ_{m}}{β}} + (2 β - 1) e^{- \frac{d_{i j} + d_{k l} - δ_{m}}{β}}) .$ Plugging Eqs. (19) and (15) in (14) we obtain (20) $cov (I_{i j}, I_{k l}) = n β ((1 - β) e^{- \frac{d_{i j} + d_{k l} - 2 δ_{m}}{β}} + (2 β - 1) e^{- \frac{d_{i j} + d_{k l} - δ_{m}}{β}} - β e^{- \frac{d_{i j} + d_{k l}}{β}}) .$ We turn to the Delta method. The function ${\hat{d}}_{i j} (I_{i j})$ can be approximated by a first-order Taylor series around E[I_ij] = np_m(d_ij) (21) ${\hat{d}}_{i j}^{⋆} (I_{i j}) = {\hat{d}}_{i j} (E [I_{i j}]) + {\hat{d}}_{i j}^{'} (E [I_{i j}]) (I_{i j} - E [I_{i j}]),$ where (22) ${\hat{d}}_{i j}^{'} (E [I_{i j}]) = {\frac{\partial {\hat{d}}_{i j} (I_{i j})}{\partial I_{i j}}|}_{I_{i j} = E [I_{i j}]} = {{(n - \frac{I_{i j}}{β})}^{- 1}|}_{I_{i j} = E [I_{i j}]} = \frac{e^{\frac{d_{i j}}{β}}}{n} .$ The covariance of ${\hat{d}}_{i j}$ and ${\hat{d}}_{k l}$ is asymptotically equal to the covariance of ${\hat{d}}_{i j}^{⋆}$ and ${\hat{d}}_{k l}^{⋆}$ (23) $cov ({\hat{d}}_{i j} (I_{i j}), {\hat{d}}_{k l} (I_{k l})) \sim cov ({\hat{d}}_{i j}^{⋆} (I_{i j}), {\hat{d}}_{k l}^{⋆} (I_{k l})))$ (24) $= {\hat{d}}_{i j}^{'} (E [I_{i j}]) {\hat{d}}_{k l}^{'} (E [I_{k l}]) cov (I_{i j}, I_{k l})$ (25) $= β [(1 - β) e^{\frac{2 δ_{m}}{β}} + (2 β - 1) e^{\frac{δ_{m}}{β}} - β] / n . □$ We provide a Maple program implementing all steps of the proof in Supplemental Materials.

Covariance under general Markov models

We have derived under N_r that the covariance of two distance estimates with an underlying shared path length δ_m is asymptotically equal to the variance of an ML estimate of a true pairwise distance δ_m, i.e., (26) $cov ({\hat{d}}_{i j}, {\hat{d}}_{k l}) \sim V (\hat{d} | d_{t} = δ_{m}) .$ We conjecture now that the relationship holds for general substitution models and apply it to derive a covariance estimator. To this end, we first discuss how the covariance can be written in terms of a rate matrix Q, an equilibrium frequency vector π (which together fully specify a substitution model), and δ_m. In the next section, we will show how δ_m can be estimated from the input distances by the method of weighted least squares (WLS).

According to the conjecture and Eq. (4) we express the desired covariance by (27) $cov ({\hat{d}}_{i j}, {\hat{d}}_{k l}) = - \frac{1}{n} E {[{\frac{\partial^{2}}{\partial d^{2}} log p (X, d)|}_{d = δ_{m}}]}^{- 1} .$ The expected value expression can be written as (28) $\sum_{\forall (u, v)} {[p (x_{u v}, d) \frac{\partial^{2}}{\partial d^{2}} log p (x_{u v}, d)]}_{d = δ_{m}} = \sum_{\forall (u, v)} {[\frac{\partial^{2}}{\partial d^{2}} p (x_{u v}, d) - \frac{{(\frac{\partial}{\partial d} p (x_{u v}, d))}^{2}}{p (x_{u v}, d)}]}_{d = δ_{m}},$ where the summation goes over all possible character pairs. In terms of a rate matrix Q and an equilibrium frequency vector π the probability of a character pair (k = 0) and the derivatives (k > 0) of the probability with respect to the distance are (29) $\frac{\partial^{(k)}}{\partial d^{(k)}} p (x_{u v}, d) = π_{u} {[Q^{k} e^{Q d}]}_{u v} .$ Plugging Eqs. (28) and (29) in (27) we obtain (30) $cov ({\hat{d}}_{i j}, {\hat{d}}_{k l}) = - \frac{1}{n} {[\sum_{\forall (u, v)} π_{u} ({[Q^{2} e^{Q δ_{m}}]}_{u v} - \frac{{[Q e^{Q δ_{m}}]}_{u v}^{2}}{{[e^{Q δ_{m}}]}_{u v}})]}^{- 1} .$ A plot of this expression as a function of δ_m can be found in Fig. S1. A covariance estimator is obtained by substituting δ_m in Eq. (30) by its WLS estimate, which we derive in the next section. To save computation time, we can discretise the distance space in the relevant range to some desired level of accuracy, precompute the expected value expressions, and store them in a hash table.

Topological relation and path length

Equation (30) expresses the covariance in the dependence case as a function of the shared path length δ_m. To obtain a covariance estimator, we determine first whether the two distances in question are dependent and, in case they are, estimate δ_m. We will do that by WLS using the six pairwise distance estimates $\{{\hat{d}}_{u v}\}$ between the four sequences i, j, k, l and their variances {v_uv}. The sequences can be related by three topological configurations: $T_{1} : ((i, k), (j, l)), T_{2} : ((i, l), (j, k)), T_{3} : ((i, j), (k, l)),$ where T₁, T₂ map to the dependence case and T₃ corresponds to the independence case. An argument set out in Supplemental Materials shows that the weighted sum of squares (S) for each of the topologies can be expressed in a simple form which is fast to compute: (31) $\begin{array}{l} S (T_{1}) = \frac{{\hat{d}}_{i j} + {\hat{d}}_{k l} - {\hat{d}}_{i l} - {\hat{d}}_{j k}}{v_{i j} + v_{k l} + v_{i l} + v_{j k}}, S (T_{2}) = \frac{{\hat{d}}_{i j} + {\hat{d}}_{k l} - {\hat{d}}_{i k} - {\hat{d}}_{j l}}{v_{i j} + v_{k l} + v_{i k} + v_{j l}}, \\ S (T_{3}) = \frac{{\hat{d}}_{i k} + {\hat{d}}_{j l} - {\hat{d}}_{i l} - {\hat{d}}_{j k}}{v_{i k} + v_{j l} + v_{i l} + v_{j k}} . \end{array}$ The best fitting topology is determined by argmin_{T_i}{S(T_i)}. If this results in T₃ the desired covariance is zero, otherwise we need to estimate δ_m. The WLS estimates are (see Supplemental Materials): (32) $2 {\hat{δ}}_{m} (T_{1}) = \frac{({\hat{d}}_{i j} + {\hat{d}}_{k l}) (v_{i l} + v_{j k}) + ({\hat{d}}_{i l} + {\hat{d}}_{j k}) (v_{i j} + v_{k l})}{v_{i j} + v_{k l} + v_{i l} + v_{j k}} - ({\hat{d}}_{i k} + {\hat{d}}_{j l}),$ (33) $2 {\hat{δ}}_{m} (T_{2}) = \frac{({\hat{d}}_{i j} + {\hat{d}}_{k l}) (v_{i k} + v_{j l}) + ({\hat{d}}_{i k} + {\hat{d}}_{j l}) (v_{i j} + v_{k l})}{v_{i j} + v_{k l} + v_{i k} + v_{j l}} - ({\hat{d}}_{i l} + {\hat{d}}_{j k}) .$ Since these are the estimators for unconstrained WLS they can result in negative values, in which case we estimate the covariance to be zero.

Simulation settings

The performance of the various covariance estimators was evaluated with the same simulation approach as in a previous study (Dessimoz & Gil, 2008). Specifically, 100 random quartets were sampled from a tree of life on 352 species. The tree was inferred by the LeastSquaresTree function (Gil & Gonnet, 2009) included in the Darwin package (Gonnet et al., 2000) using pairwise distance and variance data from the OMA project (Dessimoz et al., 2005). A uniformly distributed U(0.5, 2) expansion/contraction factor was applied on each quartet to also explore extremer regions of the branch-length space, while preserving the relative branch-length structure of the original tree.

For each dilated model quartet 10,000 times three random amino-acid sequences of length m = {200, 500, 800} were generated and mutated along the quartet assuming the GCB substitution model (Gonnet, Cohen & Benner, 1992). The entire simulation procedure was run twice, once without any insertion–deletions to produce ungapped alignments to test the methods under the true models (i.e., without the effect of alignment errors), and once introducing gaps of Zipfian distributed length (Benner, Cohen & Gonnet, 1993). The gapped sequences were aligned by global pairwise dynamic programming with the Align function from Darwin to obtain OPAs. ML pairwise distances were estimated with the EstimatePam function. The sample variance and covariance over the 10,000 samples (hereafter Monte Carlo-variance and Monte Carlo-covariance) served as a reference values, as they are unbiased estimators of the true values.

To test the asymptotic conjecture (Eq. (26)) the ungapped simulation was repeated with very long sequences (10,000 amino-acids). Then, the length δ_m of the middle branch of each model quartet was extracted. Subsequently, 10,000 random sequences were generated, and mutated each with a distance δ_m according to the GCB model. Finally, the ML distance and ML variance between each resulting pair of sequences, and the Monte Carlo-variance were computed.

Results and Discussion

Evaluation of basic components of branch-covariance

We have tested the validity of the conjecture in Eq. (26), which was derived under the N_r model, and the accuracy of the ML variance (Eq. (5)) in a simulation under the GCB model with long sequences (10,000 amino-acids).

Figure 2: Components of branch-covariance for sequence-length 10,000 amino-acids.
Error-bars indicate 95% confidence intervals. (A) Monte Carlo covariance versus Monte Carlo variance shows that the relationship derived under the *N_r* model also holds for the empirical substitution model tested here. (B) Monte Carlo variance versus mean of the ML variance shows that the ML variance is negatively biased.

Download full-size image

DOI: 10.7717/peerj.583/fig-2

A plot of the Monte Carlo-variance of the ML estimate of a pairwise distance δ_m versus the Monte Carlo-covariance between the ML estimates of two pairwise distances in the dependence case with a shared path of length δ_m corroborates the conjecture and suggests that the result is valid in general (Fig 2A). The branch-covariance relies on the ML variance to approximate the true variance. A comparison with the Monte Carlo-variance shows that it underestimates the variance for large samples and with correct alignments (Fig 2B). Therefore, we expected the branch-covariance to inherit the negative bias.

Evaluation of estimators

To evaluate and compare the various estimators under ideal conditions we have carried out two types of simulations: one without introducing indel events (ungapped simulation), and one with indel events (gapped simulation).

The gapped simulation did not require explicit alignment; the MSA is trivial without gaps. This situation corresponds to the model assumed by the Susko-, branch-, and triplet-covariance. Note that in practice with real data, there are usually gaps in an MSA estimate. In this case, the distance and covariance estimators are applied assuming the MSA to be correct. Gaps are either excluded from the analysis, or treated as unknown characters.

The ungapped simulation required alignment which was carried out by OPA. Here, the Susko-covariance can not be applied; it requires an MSA. The setting corresponds to the model assumed by the anchor-covariance, which is an adaptation of the Susko-covariance, specifically designed to bypass the problem of inconsistent homology inference between OPAs. The triplet- and branch-covariance are applicable to OPAs, as these estimators operate directly on the distances and not on the alignments (but the assumption of a gap-free and correct underlying alignment is violated).

Ungapped simulation

This section presents the performance of the branch-covariance under the true model and compares it with the Susko- and triplet-covariance (Fig. 3). To this end, the three topological cases—dependence, independence, and triplet—are treated separately.

In the dependence case the Susko-covariance is unbiased. The branch-covariance lies in most cases within the 95% confidence interval of the Monte Carlo covariance; when it lies outside then it underestimates. The negative bias confirms the prediction from the previous section. In the independence case, where the true covariance is zero, both estimators have a positive bias of comparable magnitude, though the branch-covariance appears to have a lower bias with increasing sequence length. For the anchor-covariance the negative bias is no surprise; it returns by construction non-negative values. For the triplets, as in the dependence case, Susko’s estimator appears to be unbiased and the branch-covariance shows a minor negative bias. The triplet-covariance has a positive bias of comparable magnitude to the one of the branch-covariance. Although the lack of bias is an attractive feature of an estimator, it does not guarantee a low total error, the sum of systematic and random error. Therefore, it is instructive to look at the mean square error (MSE). Indeed, the branch-covariance has a lower standard deviation and MSE than the Susko-covariance under all three topological relations (Fig. S2, Table 1).

Table 1:

Ungapped simulation.

Average MSE, median MSE (italic), and average of ratios between MSE of branch-covariance and MSE of Susko-covariance (bold). A ratio of 0.16, for instance, indicates that the MSE of the branch-covariance was—on average—16% of the MSE of the Susko-covariance. Dependence (D), triplet (T) and independence (I) case for sequence lengths 200, 500 and 800 amino-acids.

		200			500			800
		D	T	I	D	T	I	D	T	I
Branch	Avg	39.89	255.86	2.37	3.13	16.95	0.12	0.71	4.73	0.03
	Median	17.94	72.63	0.45	1.29	5.37	0.01	0.33	1.35	0.00
Susko	Avg	215.96	420.72	125.59	12.74	22.47	8.06	2.76	5.19	1.87
	Median	112.93	121.81	61.47	6.67	7.65	3.54	1.56	2.00	0.79
	Avg ratio	0.16	0.80	0.07	0.17	0.70	0.01	0.27	0.78	0.02

DOI: 10.7717/peerj.583/table-1

Gapped simulation

The branch-, anchor-, and triplet-covariance have been tested on distances derived from OPAs.

The branch-covariance has a lower bias than the anchor-covariance for all three topological relations (Fig. 4). In the dependence case the differences are minor, though the branch-covariance has a consistently higher correlation with the Monte Carlo covariance. A big difference is visible for the two other topological relations (independence and triplet), where the anchor-covariance’s bias is up to twice the branch-covariance’s. The superiority of the branch-covariance is also reflected by the MSEs (Table 2). The triplet-covariance has a greater bias than the branch-covariance for sequences of length 200; for length 500 the two estimators have quantitatively a similar bias, but in opposite directions; and for length 800 the triplet-covariance has clearly a smaller systematic error.

Table 2:

Gapped simulation.

Average MSE, median MSE (italic), and average of ratios between MSE of branch-covariance and MSE of anchor-covariance (bold). A ratio of 0.22, for instance, indicates that the MSE of the branch-covariance was—on average—22% of the MSE of the anchor-covariance. Dependence (D), triplet (T) and independence (I) case for sequence lengths 200, 500 and 800 amino-acids.

		200			500			800
		D	T	I	D	T	I	D	T	I
Branch	Avg	36.19	170.04	3.63	3.99	27.23	0.21	1.38	9.68	0.06
	Median	19.13	110.60	1.23	1.95	6.26	0.03	0.48	1.88	0.00
Anchor	Avg	159.07	242.56	95.79	11.60	95.62	7.68	3.19	42.87	2.09
	Median	102.13	170.27	59.02	6.66	12.30	3.51	1.67	4.55	0.92
	Avg ratio	0.22	0.67	0.06	0.22	0.36	0.06	0.20	0.34	0.04

DOI: 10.7717/peerj.583/table-2

Note that the anchor- and triplet-covariance have been evaluated under the same simulation conditions in a previous work (Dessimoz & Gil, 2008). They have been reproduced to evaluate the branch-covariance. The results on the triplet- and anchor-covariance reported here are in agreement with our previous results.

Conclusion

A fast and general method to estimate the covariance of pairwise ML distances estimates has been presented. The estimator is based on a conjecture (going back to Nei & Jin, 1989) which links the covariance to path lengths on the underlying phylogenetic tree. It has been proven here under a simple symmetric substitution model and formulated for general models. The estimator is applicable to distances estimated from parametric as well as empirical substitution models, and works with both MSAs and OPAs. A simulation has shown that the estimator has a lower total error than previously published estimators operating directly on the sequence data. Moreover, in contrast to these linear time complexity methods, the estimator runs in constant time in the sequence-length, provided that the pairwise distance information has been precomputed (as in the initial all-against-all pairwise comparison, customary to most pairwise distance approaches).

The evaluation under the correct model conducted here is an important baseline. However, ideal conditions are never met when working with real data, so that as future work the various estimators should be compared in situations where the model-assumptions are violated. Under such conditions, it is conceivable that estimators operating directly on the sequence data outperform the method presented here. The rationale being that they extract information from the data at hand—for instance, the actual frequencies of aligned character pairs, as opposed to the ones assumed in the substitution model—and by being more adaptive, they could be more robust to model violations.

Supplemental Information

Supplemental Material

(1) The MAPLE code supporting the proof in Section “r-state symmetric model”; (2) the MAPLE code deriving equations (31)–(33); (3) Figures S1 and S2.

DOI: 10.7717/peerj.583/supp-1

Download

ZIP file with simulation environment and simulation results

(1) A file Readme.txt describing the simulation environment and the format of the results; (2) the Darwin code for the simulation; (3) the raw results of the simulation.

DOI: 10.7717/peerj.583/supp-2

Download

[1] Benner SA, Cohen MA, Gonnet GH. 1993. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. Journal of Molecular Biology 229(4):1065-1082

[2] Bulmer M. 1989. Estimating the variability of substitution rates. Genetics 123(3):615-619

[3] Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet GH. 2005. OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements. Lecture Notes in Computer Science 61-72

[4] Dessimoz C, Gil M. 2008. Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise. BMC Evolutionary Biology 8(1):179

[5] Dessimoz C, Gil M, Schneider A, Gonnet GH. 2006. Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences. BMC Bioinformatics 7(529)

[6] Efron B, Tibshirani RJ. 1993. An introduction to the bootstrap. New York: Chapman & Hall.

[7] Elias I. 2006. Settling the intractability of multiple alignment. Journal of Computational Biology 13(7):1323-1339

[8] Felsenstein J. 1981. Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution 35:1229-1242

[9] Gascuel O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14(7):685-695

[10] Gil M, Gonnet GH. 2009. Phylogenetic tree building methods. In: Appel RD, Feytmans E, eds. Bioinformatics: a swiss perspective. Singapore: World Scientific.

[11] Gil M, Zanetti MS, Zoller S, Anisimova M. 2013. Codonphyml: fast maximum likelihood phylogeny estimation under codon substitution models. Molecular Biology and Evolution 30(6):1270-1280

[12] Gonnet GH, Cohen MA, Benner SA. 1992. Exhaustive matching of the entire protein sequence database. Science 256(5003):1443-1445

[13] Gonnet GH, Hallett MT, Korostensky C, Bernardin L. 2000. Darwin v. 2.0: an interpreted computer language for the biosciences. Bioinformatics 16(2):101-103

[14] Guindon S, Dufayard J, Lefort V, Anisimova M, Hordijk W, Gascuel O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology 59(3):307-321

[15] Jukes T, Cantor C. 1969. Evolution of protein molecules. In: Munro H, ed. Mammalian protein metabolism III. New York: Academic Press. 21-132

[16] Just W. 2001. Computational complexity of multiple sequence alignment with sp-score. Journal of Computational Biology 8(6):615-623

[17] Katoh K, Kuma K, Toh H, Miyata T. 2005. Mafft version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 33(2):511-518

[18] Kimura M, Ohta T. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. Journal of Molecular Evolution 2(1):87-90

[19] Löytynoja A, Goldman N. 2008. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320(5883):1632-1635

[20] Mihaescu R, Pachter L. 2008. Combinatorics of least-squares trees. Proceedings of the National Academy of Sciences of the United States of America 105(36):13206-13211

[21] Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48:443-453

[22] Nei M, Jin L. 1989. Variances of the average numbers of nucleotide substitutions within and between populations. Molecular Biology and Evolution 6(3):290-300

[23] Neyman J. 1971. Molecular studies of evolution: a source of novel statistical problems. In: Gupta SS, ed. Statistical decision theory and related topics. New York: Academic Press. 1-27

[24] Pawitan Y. 2001. In all likelihood: statistical modelling and inference using likelihood. New York: Oxford University Press.

[25] Roch S. 2010. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science 327(5971):1376-1379

[26] Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9):1312-1313

[27] Steel M, Hendy M, Penny D. 1988. Loss of information in genetic distances. Nature 336(6195):118