Characterizing the Phylogenetic Tree-Search Problem
- 1 March 2012
- journal article
- Published by Oxford University Press (OUP) in Systematic Biology
- Vol. 61 (2) , 228-39
- https://doi.org/10.1093/sysbio/syr097
Abstract
Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.Keywords
This publication has 31 references indexed in Scilit:
- New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0Systematic Biology, 2010
- Phylogenomics and the reconstruction of the tree of lifeNature Reviews Genetics, 2005
- Maximum Likelihood of Evolutionary Trees Is HardPublished by Springer Nature ,2005
- A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum LikelihoodSystematic Biology, 2003
- PAL: an object-oriented programming library for molecular evolution and phylogeneticsBioinformatics, 2001
- On computing the nearest neighbor interchange distancePublished by American Mathematical Society (AMS) ,2000
- Predicting the Evolution of Human Influenza AScience, 1999
- On the Linear-Cost Subtree-Transfer Distance between Phylogenetic TreesAlgorithmica, 1999
- Evidence for a clade of nematodes, arthropods and other moulting animalsNature, 1997
- A new look at the statistical model identificationIEEE Transactions on Automatic Control, 1974