MetaFam: a unified classification of protein families. I. Overview and statistics

Open Access

1 March 2001

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 17 (3) , 249-261
https://doi.org/10.1093/bioinformatics/17.3.249

Abstract

Motivation: Protein sequence classification is becoming an increasingly important means of organizing the voluminous data produced by large-scale genome sequencing projects. At present, there are several independent classification methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family classification built upon 10 publicly-accessible protein family databases (Blocks \(+\) ⁠, DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYSTERS). MetaFam’s family ‘supersets’, as we call them, are created automatically using set-theory to compare families among the databases. Families of one database are matched to those in another when the intersection of their members exceeds all other possible family pairings between the two databases. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. Results: MetaFam family supersets have several useful features: (1) each superset contains more members than the families from which it is composed, because each of the component family databases only works with a subset of our full non-redundant set of proteins; (2) conflicting assignments can be pinpointed quickly, since our analysis identifies individual members that are in conflict with the majority consensus; (3) family descriptions that are absent from automated databases can frequently be assigned; (4) statistics have been computed comparing domain boundaries, family size distributions, and overall quality of MetaFam supersets; (5) the supersets have been loaded into a relational database to allow for complex queries and visualization of the connections among families in a superset and the consensus of individual domain members; and (6) the quality of individual supersets has been assessed using numerous quantitative measures such as family consistency, connectedness, and size. We anticipate this new resource will be particularly useful to genomic database curators. Availability: Free access to the MetaFam web server is provided to all users at http://metafam.ahc.umn.edu/. Contact: metafam@ahc.umn.edu Supplementary information: Detailed distribution plots on MetaFam 2.0 supersets and its constituent family databases (e.g. superset/family sizes, domain boundary comparisons) are shown at http://metafam.ahc.umn.edu/mf2.0/stats.html. Statistics on the current release of MetaFam can be found at http://metafam.ahc.umn.edu/current_release/stats.html.

Keywords

This publication has 9 references indexed in Scilit:

MetaFam: a unified classification of protein families. II. Schema and query capabilities
Bioinformatics, 2001
InterPro—an integrated documentation resource for protein families, domains and functional sites
Bioinformatics, 2000
The SYSTERS protein sequence cluster set
Nucleic Acids Research, 2000
ProClass protein family database
Nucleic Acids Research, 2000
ProtoMap: automatic classification of protein sequences and hierarchy of protein families
Nucleic Acids Research, 2000
ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons
Nucleic Acids Research, 2000
The SBASE protein domain library, release 7.0: a collection of annotated protein sequence segments
Nucleic Acids Research, 2000
Increased coverage of protein families with the Blocks Database servers
Nucleic Acids Research, 2000
SMART: a web-based tool for the study of genetically mobile domains
Nucleic Acids Research, 2000