Sequence-based prediction of protein domains
Open Access
- 7 July 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 32 (12) , 3522-3530
- https://doi.org/10.1093/nar/gkh684
Abstract
Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology and other aspects of predicted protein structure. Here, we introduced CHOPnet, a de novo method that predicts structural domains in the absence of homology to known domains. Our method was based on neural networks and relied exclusively on information available for all proteins. Evaluating sustained performance through rigorous cross-validation on proteins of known structure, we correctly predicted the number of domains in 69% of all proteins. For 50% of the two-domain proteins the centre of the predicted boundary was closer than 20 residues to the boundary assigned from three-dimensional (3D) structures; this was about eight percentage points better than predictions by ‘equal split’. Our results appeared to compare favourably with those from previously published methods. CHOPnet may be useful to restrict the experimental testing of different fragments for structure determination in the context of structural genomics.Keywords
This publication has 76 references indexed in Scilit:
- Protein domain identification and improved sequence similarity searching using PSI‐BLASTProteins-Structure Function and Bioinformatics, 2002
- Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structureJournal of Molecular Biology, 2001
- Initial sequencing and analysis of the human genomeNature, 2001
- The Genome Sequence of Drosophila melanogasterScience, 2000
- The Protein Data BankNucleic Acids Research, 2000
- Domain assignment for protein structures using a consensus approach: Characterization and analysisProtein Science, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Life with 6000 GenesScience, 1996
- Global Fold Determination from a Small Number of Distance RestraintsJournal of Molecular Biology, 1995
- Prediction of Protein Secondary Structure at Better than 70% AccuracyJournal of Molecular Biology, 1993