Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space
Open Access
- 6 February 2006
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 34 (3) , 1066-1080
- https://doi.org/10.1093/nar/gkj494
Abstract
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.Keywords
This publication has 75 references indexed in Scilit:
- The structure of the protein universe and genome evolutionNature, 2002
- Target space for structural genomics revisitedBioinformatics, 2002
- Structural genomics: A pipeline for providing structures for the biologistProtein Science, 2002
- An efficient algorithm for large-scale detection of protein familiesNucleic Acids Research, 2002
- Knowledge-based selection of targets for structural genomicsProtein Engineering, Design and Selection, 2002
- A question of size: the eukaryotic proteome and the problems in defining itNucleic Acids Research, 2002
- Gene3D: Structural Assignment for Whole Genes and Genomes Using the CATH Domain Structure DatabaseGenome Research, 2002
- The CATH extended protein‐family database: Providing structural annotations for genome sequencesProtein Science, 2002
- A unifold, mesofold, and superfold model of protein fold useProteins-Structure Function and Bioinformatics, 2001
- Selection of representative protein data setsProtein Science, 1992