A set-theoretic approach to database searching and clustering.
Open Access
- 1 June 1998
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 14 (5) , 430-438
- https://doi.org/10.1093/bioinformatics/14.5.430
Abstract
MOTIVATION: In this paper, we introduce an iterative method of database searching and apply it to design a database clustering algorithm applicable to an entire protein database. The clustering procedure relies on the quality of the database searching routine and further improves its results based on a set-theoretic analysis of a highly redundant yet efficient to generate cluster system. RESULTS: Overall, we achieve unambiguous assignment of 80% of SWISS-PROT sequences to non-overlapping sequence clusters in an entirely automatic fashion. Our results are compared to an expert-generated clustering for validation. The database searching method is fast and the clustering technique does not require time-consuming all-against-all comparison. This allows for fast clustering of large amounts of sequences. AVAILABILITY: The resulting clustering for the PIR1 (Release 51) and SWISS-PROT (Release 34) databases is available over the Internet from http://www.dkfz-heidelberg.de/tbi/services/modest/b rowsesysters.pl. CONTACT: a.krause@dkfz-heidelberg.de; m.vingron@dkfz-heidelberg.deKeywords
This publication has 0 references indexed in Scilit: