Assessing strategies for improved superfamily recognition

Open Access

1 July 2005

journal article
research article
Published by Wiley in Protein Science

Vol. 14 (7) , 1800-1810
https://doi.org/10.1110/ps.041056105

Abstract

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (∼13,000 nonredundant structures solved to date), several powerful sequence‐based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence‐based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single‐seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D‐HMM library, CATH‐ISL increased the coverage to 86%. The single‐seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss‐Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.

Keywords

This publication has 37 references indexed in Scilit:

Evolution of Protein Superfamilies and Bacterial Genome Size
Journal of Molecular Biology, 2004
The Pfam protein families database
Nucleic Acids Research, 2004
Getting the most from PSI–BLAST
Published by Elsevier ,2002
Enhanced genome annotation using structural profiles in the program 3D-PSSM 1 1Edited by J. Thornton
Journal of Molecular Biology, 2000
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
Journal of Molecular Biology, 1998
Intermediate sequences increase the detection of homology between sequences
Journal of Molecular Biology, 1997
CATH – a hierarchic classification of protein domain structures
Published by Elsevier ,1997
Basic local alignment search tool
Journal of Molecular Biology, 1990
Protein structure alignment
Journal of Molecular Biology, 1989
How different amino acid sequences determine similar protein structures: The structure and evolutionary dynamics of the globins
Journal of Molecular Biology, 1980