Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified

Top Cited Papers

Open Access

24 March 2006

journal article
research article
Published by Springer Nature in BMC Ecology and Evolution

Vol. 6 (1) , 29
https://doi.org/10.1186/1471-2148-6-29

Abstract

In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner. We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins. This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.

Keywords

This publication has 56 references indexed in Scilit:

Accuracy of Rate Estimation Using Relaxed-Clock Models with a Critical Focus on the Early Metazoan Radiation
Molecular Biology and Evolution, 2005
ProtTest: selection of best-fit models of protein evolution
Bioinformatics, 2005
Clann: investigating phylogenetic information through supertree analyses
Bioinformatics, 2004
rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny
Journal of Molecular Evolution, 2002
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
The rapid generation of mutation data matrices from protein sequences
Bioinformatics, 1992
Basic local alignment search tool
Journal of Molecular Biology, 1990
Evolutionary trees from DNA sequences: A maximum likelihood approach
Journal of Molecular Evolution, 1981
Estimating the Dimension of a Model
The Annals of Statistics, 1978
A new look at the statistical model identification
IEEE Transactions on Automatic Control, 1974