A novel algorithm for computational identification of contaminated EST libraries
Open Access
- 1 February 2003
- journal article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 31 (3) , 1067-1074
- https://doi.org/10.1093/nar/gkg170
Abstract
A key goal of the Human Genome Project was to understand the complete set of human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre‐mRNA, and ESTs that span non‐canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re‐evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.Keywords
This publication has 53 references indexed in Scilit:
- Identification of foreign gene sequences by transcript filtering against the human genomeNature Genetics, 2002
- Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tagsProceedings of the National Academy of Sciences, 2000
- Analysis of expressed sequence tags indicates 35,000 human genesNature Genetics, 2000
- Gene Index analysis of the human genome estimates approximately 120,000 genesNature Genetics, 2000
- EST comparison indicates 38% of human mRNAs contain possible alternative splice formsFEBS Letters, 2000
- Comprehensive analyses of prostate gene expression: Convergence of expressed sequence tag databases, transcript profiling and proteomicsElectrophoresis, 2000
- ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genomeNature Genetics, 2000
- Genome Annotation Assessment in Drosophila melanogasterGenome Research, 2000
- Frequent Alternative Splicing of Human GenesGenome Research, 1999
- Identification of Candidate Coding Region Single Nucleotide Polymorphisms in 165 Human Genes Using Assembled Expressed Sequence TagsGenome Research, 1999