Characterizing the Phylogenetic Tree-Search Problem

1 March 2012

journal article
Published by Oxford University Press (OUP) in Systematic Biology

Vol. 61 (2) , 228-39
https://doi.org/10.1093/sysbio/syr097

Abstract

Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.

Keywords

This publication has 31 references indexed in Scilit:

New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0
Systematic Biology, 2010
Phylogenomics and the reconstruction of the tree of life
Nature Reviews Genetics, 2005
Maximum Likelihood of Evolutionary Trees Is Hard
Published by Springer Nature ,2005
A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood
Systematic Biology, 2003
PAL: an object-oriented programming library for molecular evolution and phylogenetics
Bioinformatics, 2001
On computing the nearest neighbor interchange distance
Published by American Mathematical Society (AMS) ,2000
Predicting the Evolution of Human Influenza A
Science, 1999
On the Linear-Cost Subtree-Transfer Distance between Phylogenetic Trees
Algorithmica, 1999
Evidence for a clade of nematodes, arthropods and other moulting animals
Nature, 1997
A new look at the statistical model identification
IEEE Transactions on Automatic Control, 1974