CHOP proteins into structural domain‐like fragments
- 1 April 2004
- journal article
- research article
- Published by Wiley in Proteins-Structure Function and Bioinformatics
- Vol. 55 (3) , 678-688
- https://doi.org/10.1002/prot.20095
Abstract
We developed a method CHOP dissecting proteins into domain-like fragments. The basic idea was to cut proteins beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of known proteins. In this way, CHOP dissected more than two thirds of all proteins from 62 proteomes. Analysis of our structural domain-like fragments revealed four surprising results. First, >70% of all dissected proteins contained more than one fragment. Second, most domains spanned on average over similar to100 residues. This average was similar for eukaryotic and prokaryotic proteins, and it is also valid-although previously not described-for all proteins in the PDB. Third, single-domain proteins were significant longer than most domains in multidomain proteins. Fourth, three fourths of all domains appeared shorter than 210 residues. We believe that our CHOP fragments constituted an important resource for functional and structural genom-ics. Nevertheless, our main motivation to develop CHOP was that the single-linkage clustering method failed to adequately group full-length proteins. In contrast, CLUP-the simple clustering scheme CLUP introduced here-succeeded largely to group the CHOP fragments from 62 proteomes such that all members of one cluster shared a basic structural core. CLUP found >63,000 multi-and >118,000 single-member clusters. Although most fragments were restricted to a particular cluster, similar to24% of the fragments were duplicated in at least two clusters. Our thresholds for grouping two fragments into the same cluster were rather conservative. Nevertheless, our results suggested that structural genomics initiatives have to target >30,000 fragments to at least cover the multi-member clusters in 62 proteomes. (C) 2004Wiley-Liss,Inc.Keywords
This publication has 95 references indexed in Scilit:
- Protein domain identification and improved sequence similarity searching using PSI‐BLASTProteins-Structure Function and Bioinformatics, 2002
- Within the twilight zone: a sensitive profile-profile comparison tool based on information theoryJournal of Molecular Biology, 2002
- Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structureJournal of Molecular Biology, 2001
- Modularity and homology: modelling of the titin type I modules and their interfacesJournal of Molecular Biology, 2001
- The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coliJournal of Molecular Biology, 2001
- Domain combinations in archaeal, eubacterial and eukaryotic proteomesJournal of Molecular Biology, 2001
- The Protein Data BankNucleic Acids Research, 2000
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- One thousand families for the molecular biologistNature, 1992
- Hierarchic organization of domains in globular proteinsJournal of Molecular Biology, 1979