UniRef: comprehensive and non-redundant UniProt reference clusters

Top Cited Papers

Open Access

22 March 2007

journal article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 23 (10) , 1282-1288
https://doi.org/10.1093/bioinformatics/btm098

Abstract

Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences.Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of ∼10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis.Availability: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/unirefContact: bes23@georgetown.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 50 references indexed in Scilit:

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2007
Comparative bioinformatics analyses and profiling of lysosome-related organelle proteomes
International Journal of Mass Spectrometry, 2007
Database resources of the National Center for Biotechnology Information
Nucleic Acids Research, 2006
Ensembl 2007
Nucleic Acids Research, 2006
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 2006
The TIGR Plant Transcript Assemblies database
Nucleic Acids Research, 2006
Genomic and Genetic Characterization of RiceCen3Reveals Extensive Transcription and Evolutionary Implications of a Complex Centromere
Plant Cell, 2006
DOUTfinder--identification of distant domain outliers using subsignificant sequence similarity
Nucleic Acids Research, 2006
Identification of multiple distinct Snf2 subfamilies with conserved structural motifs
Nucleic Acids Research, 2006
Selection of representative protein data sets
Protein Science, 1992