Making sense of EST sequences by CLOBBing them

Open Access

25 October 2002

journal article
research article
Published by Springer Nature in BMC Bioinformatics

Vol. 3 (1) , 31
https://doi.org/10.1186/1471-2105-3-31

Abstract

Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify expressed genes. However, they are often prone to sequencing errors and typically define incomplete transcripts. To increase the amount of information obtainable from ESTs and reduce sequencing errors, it is necessary to cluster ESTs into groups sharing significant sequence similarity. As part of our ongoing EST programs investigating 'orphan' genomes, we have developed a clustering algorithm, CLOBB (Cluster on the basis of BLAST similarity) to identify and cluster ESTs. CLOBB may be used incrementally, preserving original cluster designations. It tracks cluster-specific events such as merging, identifies 'superclusters' of related clusters and avoids the expansion of chimeric clusters. Based on the Perl scripting language, CLOBB is highly portable relying only on a local installation of NCBI's freely available BLAST executable and can be usefully applied to > 95 % of the current EST datasets. Analysis of the Danio rerio EST dataset demonstrates that CLOBB compares favourably with two less portable systems, UniGene and TIGR Gene Indices. CLOBB provides a highly portable EST clustering solution and is freely downloaded from: http://www.nematodes.org/CLOBB

Keywords

This publication has 18 references indexed in Scilit:

An optimized protocol for analysis of EST sequences
Nucleic Acids Research, 2000
JESAM: CORBA software components to create and publish EST alignments and clusters
Bioinformatics, 2000
d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences
Genome Research, 1999
Automated clustering and assembly of large EST collections.
1998
SEALS: a system for easy analysis of lots of sequences.
1997
An Improved Sequence Assembly Program
Genomics, 1996
[2] TDB: New databases for biological discovery
Published by Elsevier ,1996
INITIAL ASSESSMENT OF HUMAN GENE DIVERSITY AND EXPRESSION PATTERNS BASED UPON 83-MILLION NUCLEOTIDES OF CDNA SEQUENCE
1995
dbEST — database for “expressed sequence tags”
Nature Genetics, 1993
Basic local alignment search tool
Journal of Molecular Biology, 1990