Impact of Taxon Sampling on the Estimation of Rates of Evolution at Sites
Open Access
- 8 December 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Molecular Biology and Evolution
- Vol. 22 (3) , 784-791
- https://doi.org/10.1093/molbev/msi065
Abstract
The function of individual sites within a protein influences their rate of accepted point mutation. During the computation of phylogenetic likelihoods, rate heterogeneity can be modeled on a site-per-site basis with relative rates drawn from a discretized Γ-distribution. Site-rate estimates (e.g., the rate of highest posterior probability given the data at a site) can then be used as a measure of evolutionary constraints imposed by function. However, if the sequence availability is limited, the estimation of rates is subject to sampling error. This article presents a simulation study that evaluates the robustness of evolutionary site-rate estimates for both small and phylogenetically unbalanced samples. The sampling error on rate estimates was first evaluated for alignments that included 5–45 sequences, sampled by jackknifing, from a master alignment containing 968 sequences. We observed that the potentially enhanced resolution among site rates due to the inclusion of a larger number of rate categories is negated by the difficulty in correctly estimating intermediate rates. This effect is marked for data sets with less than 30 sequences. Although the computation of likelihood theoretically accounts for phylogenetic distances through branch lengths, the introduction of a single long-branch outlier sequence had a significant negative effect on site-rate estimates. Finally, the presence of a shift in rates of evolution between related lineages can be diagnostic of a gain/loss of function within a protein family. Our analyses indicate that detecting these rate shifts is a harder problem than estimating rates. This is so, partially, because the difference in rates depends on two rate estimates, each with an intrinsic uncertainty. The performances of four methods to detect these site-rate shifts are evaluated and compared. Guidelines are suggested for preparing data sets minimally influenced by error introduced by sequence sampling.Keywords
This publication has 38 references indexed in Scilit:
- A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement ProcessMolecular Biology and Evolution, 2004
- Comparison of Site-Specific Rate-Inference Methods for Protein Sequences: Empirical Bayesian Methods Are SuperiorMolecular Biology and Evolution, 2004
- A covarion-based method for detecting molecular adaptation: application to the evolution of primate mitochondrial genomesProceedings Of The Royal Society B-Biological Sciences, 2002
- The Effect of Taxon Sampling on Estimating Rate Heterogeneity Parameters of Maximum-Likelihood ModelsMolecular Biology and Evolution, 1999
- Coevolving protein residues: maximum likelihood identification and relationship to structure 1 1Edited by G. Von HeijneJournal of Molecular Biology, 1999
- PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic treesBioinformatics, 1997
- Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree TopologiesMolecular Biology and Evolution, 1996
- Among-site rate variation and its impact on phylogenetic analysesTrends in Ecology & Evolution, 1996
- The rapid generation of mutation data matrices from protein sequencesBioinformatics, 1992
- Algorithm AS 183: An Efficient and Portable Pseudo-Random Number GeneratorJournal of the Royal Statistical Society Series C: Applied Statistics, 1982