Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood
Open Access
- 5 December 2003
- journal article
- research article
- Published by Oxford University Press (OUP) in Molecular Biology and Evolution
- Vol. 21 (3) , 468-488
- https://doi.org/10.1093/molbev/msh039
Abstract
Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N − 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.Keywords
This publication has 11 references indexed in Scilit:
- Comparative analyses of multi-species sequences from targeted genomic regionsNature, 2003
- Quantitative Estimates of Sequence Divergence for Comparative Analyses of Mammalian GenomesGenome Research, 2003
- Transcription-associated mutational asymmetry in mammalian evolutionNature Genetics, 2003
- Covariation in Frequencies of Substitution, Deletion, Transposition, and Recombination During Eutherian EvolutionGenome Research, 2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002
- Neighboring-Nucleotide Effects on Single Nucleotide Polymorphisms: A Study of 2.6 Million Polymorphisms Across the Human GenomeGenome Research, 2002
- The Influence of Adjacent Nucleotides on the Pattern of Nucleotide Substitution in Mitochondrial Introns of AngiospermsJournal of Molecular Evolution, 2002
- The Human Genome Browser at UCSCGenome Research, 2002
- T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000
- Using Evolutionary Trees in Protein Secondary Structure Prediction and Other Comparative Sequence AnalysesJournal of Molecular Biology, 1996