Selection of representative protein data sets
Open Access
- 1 March 1992
- journal article
- research article
- Published by Wiley in Protein Science
- Vol. 1 (3) , 409-417
- https://doi.org/10.1002/pro.5560010313
Abstract
The Protein Data Bank currently contains about 600 data sets of three‐dimensional protein coordinates determined by X‐ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence‐structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server “netserv@embl‐heidelberg.de.” The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three‐dimensional protein structures.Keywords
This publication has 13 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Detection of common three‐dimensional substructures in proteinsProteins-Structure Function and Bioinformatics, 1991
- Side-chain clusters in protein structures and their role in protein foldingJournal of Molecular Biology, 1991
- Amino acid similarity coefficients for protein modeling and sequence alignment derived from main-chain folding anglesJournal of Molecular Biology, 1991
- The SWISS-PROT protein sequence data bankNucleic Acids Research, 1991
- Database of homology‐derived protein structures and the structural meaning of sequence alignmentProteins-Structure Function and Bioinformatics, 1991
- A rapid method of protein structure alignmentJournal of Theoretical Biology, 1990
- A 3D building blocks approach to analyzing and predicting structure of proteinsProteins-Structure Function and Bioinformatics, 1989
- Identification of predictive sequence motifs limited by protein structure data base sizeNature, 1988
- The protein data bank: A computer-based archival file for macromolecular structuresJournal of Molecular Biology, 1977