CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures

Open Access

30 November 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 3 (11) , e232-2347
https://doi.org/10.1371/journal.pcbi.0030232

Abstract

We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification. Proteins comprise individual folding units known as domains, with a significant proportion containing two or more (multidomain structures). Each domain is thought to represent a unit of evolution and adopts a specific fold. Detecting domains is often the first step in classifying proteins into evolutionary families for studying the relationship between sequence, structure, and function. Automatically identifying domains from structural data is problematic due to the fact that domains vary substantially in their compactness and geometric separation from one another in the whole protein. We present a novel method, CATHEDRAL, which iteratively identifies each domain by comparing a query structure against a library of manually verified domains in the CATH domain database through computational structure comparison. We find that CATHEDRAL is able to outperform the majority of popular structure comparison methods for finding structural relatives. Furthermore, it is able to accurately identify domain boundaries and outperform other methods of structure-based domain prediction for the majority of proteins. CATHEDRAL is available as a Webserver to provide domain annotations for the community and hence aid in structural and functional characterisation of newly solved protein structures.

Keywords

This publication has 45 references indexed in Scilit:

Structural Diversity of Domain Superfamilies in the CATH Database
Journal of Molecular Biology, 2006
Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions
Acta Crystallographica Section D-Biological Crystallography, 2004
Toward Consistent Assignment of Structural Domains in Proteins
Journal of Molecular Biology, 2004
Domain combinations in archaeal, eubacterial and eukaryotic proteomes
Journal of Molecular Biology, 2001
Characterization of novel proteins based on known protein structures 1 1Edited by R. Huber
Journal of Molecular Biology, 2000
The Protein Data Bank
Nucleic Acids Research, 2000
Domain assignment for protein structures using a consensus approach: Characterization and analysis
Protein Science, 1998
CATH – a hierarchic classification of protein domain structures
Published by Elsevier ,1997
Threading a database of protein cores
Proteins-Structure Function and Bioinformatics, 1995
Definition of general topological equivalence in protein structures
Journal of Molecular Biology, 1990