Exploration of Uncharted Regions of the Protein Universe
Open Access
- 29 September 2009
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Biology
- Vol. 7 (9) , e1000205
- https://doi.org/10.1371/journal.pbio.1000205
Abstract
The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies. More than 40% of known proteins lack any annotation within public databases and are usually referred to as hypothetical proteins despite most of them being real and many being evolutionarily conserved and thus expected to play important biological roles. Determination of the three-dimensional structures of representatives of more than 240 families of protein domains of unknown function by the Protein Structure Initiative has provided a unique sample of regions of the protein universe that, until this systematic effort, were completely uncharacterized. Analysis of these structures reveals that most of the 240 families can be considered as remote homologs of already known protein families. Such distant evolutionary links can sometimes be predicted by current state-of-the-art sequence comparison tools, but structural analysis has led to the first hypotheses about biological functions for many of these uncharacterized proteins, and serves as a starting point for experimental studies. The rapid pace of discovery of such relationships appears to suggest that the protein universe is made up of a relatively small and stable number of ‘extended neighborhoods’ that bring together distantly related protein families. Thus, the vast uncharacterized part of protein universe, called by some “the dark matter of protein space”, may consist mainly of highly divergent homologs. Continued structural characterization of these previously under-investigated regions of the protein universe should further help unravel the patterns and rules that led to such divergence in the evolution of protein structure and function.Keywords
This publication has 57 references indexed in Scilit:
- Exploring the structure and function paradigmCurrent Opinion in Structural Biology, 2008
- Data growth and its impact on the SCOP database: new developmentsNucleic Acids Research, 2007
- Successful design and conduct of genome-wide association studiesHuman Molecular Genetics, 2007
- New developments in the InterPro databaseNucleic Acids Research, 2007
- Highly parallel genomic assaysNature Reviews Genetics, 2006
- Pfam: clans, web tools and servicesNucleic Acids Research, 2006
- FFAS03: a server for profile-profile sequence alignmentsNucleic Acids Research, 2005
- Conserved ‘hypothetical’ proteins: new hints and new puzzlesComparative and Functional Genomics, 2001
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- One thousand families for the molecular biologistNature, 1992