CleanEST: a database of cleansed EST libraries
Open Access
- 2 October 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 37 (Database) , D686-D689
- https://doi.org/10.1093/nar/gkn648
Abstract
The EST division of GenBank, dbEST, is widely used in many applications such as gene discovery and verification of exon–intron structure. However, the use of EST sequences in the dbEST libraries is often hampered by inconsistent terminology used to describe the library sources and by the presence of contaminated sequences. Here, we describe CleanEST, a novel database server that classified dbEST libraries and removes contaminants. We classified all dbEST libraries according to species and sequencing center. In addition, we further classified human EST libraries by anatomical and pathological systems according to eVOC ontologies. For each dbEST library, we provide two different cleansed sequences: ‘pre-cleansed’ and ‘user-cleansed’. To generate pre-cleansed sequences, we cleansed sequences in dbEST by alignment of EST sequences against well-known contamination sources: UniVec, Escherichia coli, mitochondria and chloroplast (for plant). To provide user-cleansed sequences, we built an automatic user-cleansing pipeline, in which sequences of a user-selected library are cleansed on-the-fly according to user-selected options. The server is available at http://cleanest.kobic.re.kr/ and the database is updated monthly.Keywords
This publication has 20 references indexed in Scilit:
- ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequencesNucleic Acids Research, 2007
- A hitchhiker's guide to expressed sequence tag (EST) analysisBriefings in Bioinformatics, 2006
- ParPEST: a pipeline for EST data analysis based on parallel computingBMC Bioinformatics, 2005
- GenBankNucleic Acids Research, 2004
- NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2004
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- PartiGene—constructing partial genomesBioinformatics, 2004
- eVOC: A Controlled Vocabulary for Unifying Gene Expression DataGenome Research, 2003
- A novel algorithm for computational identification of contaminated EST librariesNucleic Acids Research, 2003
- dbEST — database for “expressed sequence tags”Nature Genetics, 1993