A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives

Open Access

31 March 2011

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 6 (3) , e18093
https://doi.org/10.1371/journal.pone.0018093

Abstract

Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.

Keywords

This publication has 70 references indexed in Scilit:

Protein interactions and ligand binding: From protein subfamilies to functional specificity
Proceedings of the National Academy of Sciences, 2010
Jalview Version 2—a multiple sequence alignment editor and analysis workbench
Bioinformatics, 2009
Sonic hedgehog mutations identified in holoprosencephaly patients can act in a dominant negative manner
Human Genetics, 2008
INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification
Bioinformatics, 2008
Bioinformatics challenges of new sequencing technology
Published by Elsevier ,2008
Recent developments in the MAFFT multiple sequence alignment program
Briefings in Bioinformatics, 2008
Clustal W and Clustal X version 2.0
Bioinformatics, 2007
Refining multiple sequence alignments with conserved core regions
Nucleic Acids Research, 2006
IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content
Bioinformatics, 2005
T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
Journal of Molecular Biology, 2000