Quality measures for protein alignment benchmarks
Open Access
- 4 January 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 38 (7) , 2145-2153
- https://doi.org/10.1093/nar/gkp1196
Abstract
Multiple protein sequence alignment methods are central to many applications in molecular biology. These methods are typically assessed on benchmark datasets including BALIBASE, OXBENCH, PREFAB and SABMARK, which are important to biologists in making informed choices between programs. In this article, annotations of domain homology and secondary structure are used to define new measures of alignment quality and are used to make the first systematic, independent evaluation of these benchmarks. These measures indicate sensitivity and specificity while avoiding the ambiguous residue correspondences and arbitrary distance cutoffs inherent to structural superpositions. Alignments by selected methods that indicate high-confidence columns (ALIGN-M, DIALIGN-T, FSA and MUSCLE) are also assessed. Fold space coverage and effective benchmark database sizes are estimated by reference to domain annotations, and significant redundancy is found in all benchmarks except SABMARK. Questionable alignments are found in all benchmarks, especially in BALIBASE where 87% of sequences have unknown structure, 20% of columns contain different folds according to SUPERFAMILY and 30% of ‘core block’ columns have conflicting secondary structure according to DSSP. A careful analysis of current protein multiple alignment benchmarks calls into question their ability to determine reliable algorithm rankings.Keywords
This publication has 47 references indexed in Scilit:
- Optimizing substitution matrix choice and gap parameters for sequence alignmentBMC Bioinformatics, 2009
- Fast Statistical AlignmentPLoS Computational Biology, 2009
- The SOCS Box Domain of SOCS3: Structure and Interaction with the ElonginBC-Cullin5 Ubiquitin LigaseJournal of Molecular Biology, 2008
- DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignmentAlgorithms for Molecular Biology, 2008
- Crystal structure of the SOCS2–elongin C–elongin B complex defines a prototypical SOCS box ubiquitin ligaseProceedings of the National Academy of Sciences, 2006
- MUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic Acids Research, 2004
- Domain assignment for protein structures using a consensus approach: Characterization and analysisProtein Science, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- CATH – a hierarchic classification of protein domain structuresPublished by Elsevier ,1997
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresBiopolymers, 1983