Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination

1 January 2003

journal article
research article
Published by Springer Nature in Journal of Structural and Functional Genomics

Vol. 4 (2/3) , 67-78
https://doi.org/10.1023/a:1026113408773

Abstract

There is a limited repertoire of domain families in nature that are duplicated and combined in different ways to form the set of proteins in a genome. Most proteins in both prokaryote and eukaryote genomes consist of two or more domains, and we show that the family size distribution of multi-domain protein families follows a power law like that of individual families. Most domain pairs occur in four to six different domain architectures: in isolation and in combinations with different partners. We showed previously that within the set of all pairwise domain combinations, most small and medium-sized families are observed in combination with one or two other families, while a few large families are very versatile and combine with many different partners. Though this may appear to be a stochastic pattern, in which large families have more combination partners by virtue of their size, we establish here that all the domain families with more than three members in genomes are duplicated more frequently than would be expected by chance considering their number of neighbouring domains. This duplication of domain pairs is statistically significant for between one and three quarters of all families with seven or more members. For the majority of pairwise domain combinations, there is no known three-dimensional structure of the two domains together, and we term these novel combinations. Novel domain combinations are interesting and important targets for structural elucidation, as the geometry and interaction between the domains will help understand the function and evolution of multi-domain proteins. Of particular interest are those combinations that occur in the largest number of multi-domain proteins, and several of these frequent novel combinations contain DNA-binding domains. Abbreviations: SCOP: Structural Classification of Proteins database, PDB: Protein DataBank, HMM: hidden Markov model

Keywords

This publication has 30 references indexed in Scilit:

CDART: Protein Homology by Domain Architecture
Genome Research, 2002
The Protein Data Bank
Acta Crystallographica Section D-Biological Crystallography, 2002
Interrogating protein interaction networks through structural biology
Proceedings of the National Academy of Sciences, 2002
SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments
Nucleic Acids Research, 2002
SCOP database in 2002: refinements accommodate structural genomics
Nucleic Acids Research, 2002
Comparing function and structure between entire proteomes
Protein Science, 2001
Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins
Genome Research, 2001
Structural genomics: an overview
Progress in Biophysics and Molecular Biology, 2000
Advances in structural genomics
Current Opinion in Structural Biology, 1999
An X-ray diffraction study of inhibited derivatives of α-chymotrypsin
Journal of Molecular Biology, 1966