Assessing strategies for improved superfamily recognition
Open Access
- 1 July 2005
- journal article
- research article
- Published by Wiley in Protein Science
- Vol. 14 (7) , 1800-1810
- https://doi.org/10.1110/ps.041056105
Abstract
There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (∼13,000 nonredundant structures solved to date), several powerful sequence‐based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence‐based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single‐seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D‐HMM library, CATH‐ISL increased the coverage to 86%. The single‐seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss‐Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.Keywords
This publication has 37 references indexed in Scilit:
- Evolution of Protein Superfamilies and Bacterial Genome SizeJournal of Molecular Biology, 2004
- The Pfam protein families databaseNucleic Acids Research, 2004
- Getting the most from PSI–BLASTPublished by Elsevier ,2002
- Enhanced genome annotation using structural profiles in the program 3D-PSSM 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- Intermediate sequences increase the detection of homology between sequencesJournal of Molecular Biology, 1997
- CATH – a hierarchic classification of protein domain structuresPublished by Elsevier ,1997
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Protein structure alignmentJournal of Molecular Biology, 1989
- How different amino acid sequences determine similar protein structures: The structure and evolutionary dynamics of the globinsJournal of Molecular Biology, 1980