Identification and distribution of protein families in 120 completed genomes using Gene3D

14 March 2005

journal article
research article
Published by Wiley in Proteins-Structure Function and Bioinformatics

Vol. 59 (3) , 603-615
https://doi.org/10.1002/prot.20409

Abstract

Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575–1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233–244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503–514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power‐law behavior such that the largest 2,000 domain families can be mapped to ∼70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While ∼50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/. Proteins 2005.

Keywords

This publication has 46 references indexed in Scilit:

CHOP proteins into structural domain‐like fragments
Proteins-Structure Function and Bioinformatics, 2004
The Pfam protein families database
Nucleic Acids Research, 2004
Target Selection and Determination of Function in Structural Genomics
IUBMB Life, 2003
An efficient algorithm for large-scale detection of protein families
Nucleic Acids Research, 2002
Domain combinations in archaeal, eubacterial and eukaryotic proteomes
Journal of Molecular Biology, 2001
KEGG: Kyoto Encyclopedia of Genes and Genomes
Nucleic Acids Research, 2000
The Protein Data Bank
Nucleic Acids Research, 2000
GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences
Journal of Molecular Biology, 1999
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CATH – a hierarchic classification of protein domain structures
Published by Elsevier ,1997