Quality measures for protein alignment benchmarks

Open Access

4 January 2010

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 38 (7) , 2145-2153
https://doi.org/10.1093/nar/gkp1196

Abstract

Multiple protein sequence alignment methods are central to many applications in molecular biology. These methods are typically assessed on benchmark datasets including BALIBASE, OXBENCH, PREFAB and SABMARK, which are important to biologists in making informed choices between programs. In this article, annotations of domain homology and secondary structure are used to define new measures of alignment quality and are used to make the first systematic, independent evaluation of these benchmarks. These measures indicate sensitivity and specificity while avoiding the ambiguous residue correspondences and arbitrary distance cutoffs inherent to structural superpositions. Alignments by selected methods that indicate high-confidence columns (ALIGN-M, DIALIGN-T, FSA and MUSCLE) are also assessed. Fold space coverage and effective benchmark database sizes are estimated by reference to domain annotations, and significant redundancy is found in all benchmarks except SABMARK. Questionable alignments are found in all benchmarks, especially in BALIBASE where 87% of sequences have unknown structure, 20% of columns contain different folds according to SUPERFAMILY and 30% of ‘core block’ columns have conflicting secondary structure according to DSSP. A careful analysis of current protein multiple alignment benchmarks calls into question their ability to determine reliable algorithm rankings.

Keywords

This publication has 47 references indexed in Scilit:

Optimizing substitution matrix choice and gap parameters for sequence alignment
BMC Bioinformatics, 2009
Fast Statistical Alignment
PLoS Computational Biology, 2009
The SOCS Box Domain of SOCS3: Structure and Interaction with the ElonginBC-Cullin5 Ubiquitin Ligase
Journal of Molecular Biology, 2008
DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment
Algorithms for Molecular Biology, 2008
Crystal structure of the SOCS2–elongin C–elongin B complex defines a prototypical SOCS box ubiquitin ligase
Proceedings of the National Academy of Sciences, 2006
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Research, 2004
Domain assignment for protein structures using a consensus approach: Characterization and analysis
Protein Science, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CATH – a hierarchic classification of protein domain structures
Published by Elsevier ,1997
Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features
Biopolymers, 1983