Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models
Open Access
- 30 July 2010
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 5 (7) , e11230
- https://doi.org/10.1371/journal.pone.0011230
Abstract
Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.Keywords
This publication has 13 references indexed in Scilit:
- Epitope Discovery with Phylogenetic Hidden Markov ModelsMolecular Biology and Evolution, 2010
- Solvent Exposure Imparts Similar Selective Pressures across a Range of Yeast ProteinsMolecular Biology and Evolution, 2009
- Models of coding sequence evolutionBriefings in Bioinformatics, 2008
- A Maximum Likelihood Method for Detecting Directional Evolution in Protein Sequences and Its Application to Influenza A VirusMolecular Biology and Evolution, 2008
- An Empirical Codon Model for Protein Sequence EvolutionMolecular Biology and Evolution, 2007
- A Model of Directional Selection Applied to the Evolution of Drug Resistance in HIV-1Molecular Biology and Evolution, 2007
- PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred treesNucleic Acids Research, 2006
- Site-to-Site Variation of Synonymous Substitution RatesMolecular Biology and Evolution, 2005
- HyPhy: hypothesis testing using phylogeniesBioinformatics, 2004
- Estimating the Dimension of a ModelThe Annals of Statistics, 1978