CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures
Open Access
- 30 November 2007
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 3 (11) , e232-2347
- https://doi.org/10.1371/journal.pcbi.0030232
Abstract
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification. Proteins comprise individual folding units known as domains, with a significant proportion containing two or more (multidomain structures). Each domain is thought to represent a unit of evolution and adopts a specific fold. Detecting domains is often the first step in classifying proteins into evolutionary families for studying the relationship between sequence, structure, and function. Automatically identifying domains from structural data is problematic due to the fact that domains vary substantially in their compactness and geometric separation from one another in the whole protein. We present a novel method, CATHEDRAL, which iteratively identifies each domain by comparing a query structure against a library of manually verified domains in the CATH domain database through computational structure comparison. We find that CATHEDRAL is able to outperform the majority of popular structure comparison methods for finding structural relatives. Furthermore, it is able to accurately identify domain boundaries and outperform other methods of structure-based domain prediction for the majority of proteins. CATHEDRAL is available as a Webserver to provide domain annotations for the community and hence aid in structural and functional characterisation of newly solved protein structures.Keywords
This publication has 45 references indexed in Scilit:
- Structural Diversity of Domain Superfamilies in the CATH DatabaseJournal of Molecular Biology, 2006
- Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensionsActa Crystallographica Section D-Biological Crystallography, 2004
- Toward Consistent Assignment of Structural Domains in ProteinsJournal of Molecular Biology, 2004
- Domain combinations in archaeal, eubacterial and eukaryotic proteomesJournal of Molecular Biology, 2001
- Characterization of novel proteins based on known protein structures 1 1Edited by R. HuberJournal of Molecular Biology, 2000
- The Protein Data BankNucleic Acids Research, 2000
- Domain assignment for protein structures using a consensus approach: Characterization and analysisProtein Science, 1998
- CATH – a hierarchic classification of protein domain structuresPublished by Elsevier ,1997
- Threading a database of protein coresProteins-Structure Function and Bioinformatics, 1995
- Definition of general topological equivalence in protein structuresJournal of Molecular Biology, 1990