Multiple sequence alignment: a major challenge to large-scale phylogenetics
Open Access
- 7 January 2011
- journal article
- Published by Public Library of Science (PLoS) in PLoS Currents
- Vol. 2, RRN1198
- https://doi.org/10.1371/currents.rrn1198
Abstract
Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates. Therefore, the estimation of highly accurate multiple sequence alignments is a major challenge for Tree of Life projects, and more generally for large-scale systematics studies.Keywords
This publication has 27 references indexed in Scilit:
- FastTree 2 – Approximately Maximum-Likelihood Trees for Large AlignmentsPLOS ONE, 2010
- Fast Statistical AlignmentPLoS Computational Biology, 2009
- Barking Up The Wrong Treelength: The Impact of Gap Penalty on Alignment and Tree AccuracyIEEE/ACM Transactions on Computational Biology and Bioinformatics, 2008
- Probalign: multiple sequence alignment using partition function posterior probabilitiesBioinformatics, 2006
- RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed modelsBioinformatics, 2006
- Exploring the Relationship between Sequence Similarity and Accurate Phylogenetic TreesMolecular Biology and Evolution, 2006
- ProbCons: Probabilistic consistency-based multiple sequence alignmentGenome Research, 2005
- MUSCLE: a multiple sequence alignment method with reduced time and space complexityBMC Bioinformatics, 2004
- SATCHMO: sequence alignment and tree construction using hidden Markov modelsBioinformatics, 2003
- T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000