Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data.
Open Access
- 1 September 1996
- journal article
- case report
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 6 (9) , 829-845
- https://doi.org/10.1101/gr.6.9.829
Abstract
A rigorous analysis of the Merck-sponsored EST data with respect to known gene sequences increases the utility of the data set and helps refine methods for building a gene index. A highly curated human transcript data base was used as a reference data set of known genes. A detailed analysis of EST sequences derived from known genes was performed to assess the accuracy of EST sequence annotation. The EST data was screened to remove low-quality and low-complexity sequences. A set of high-quality ESTs similar to the transcript data base was identified using BLAST; this subset of ESTs was compared with the set of known genes using the Smith-Waterman algorithm. Error rates of several types were assessed based on a flexible match criterion defining sequence identity. The rate of lane-tracking errors is very low, approximately 0.5%. Insert size data is accurate within approximately 20%. Reversed clone and internal priming error rates are approximately 5% and 2.5%, respectively, contributing to the incorrect identification of reads as 3' ends of genes. Follow-up investigation reveals that a significant number of clones, miscategorized as reversed, represent overlapping genes on the opposite strand of entries in the transcript data base. Relevance of these results to the creation of a high-quality index to the human genome capable of supporting diverse genomic investigations is discussed.Keywords
This publication has 20 references indexed in Scilit:
- Identification of common molecular subsequencesPublished by Elsevier ,2004
- Further progress towards a catalogue of all Arabidopsis genes: analysis of a set of 5000 non‐redundant ESTsThe Plant Journal, 1996
- PROFILER: a tool for automatic searching of internally maintained databasesBioinformatics, 1995
- The Genexpress Index: a resource for gene discovery and the genic map of the human genome.Genome Research, 1995
- ESTablishing a human transcript mapNature Genetics, 1995
- dbEST — database for “expressed sequence tags”Nature Genetics, 1993
- Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithmsGenomics, 1991
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Transcription termination and 3′ processing: the end is in site!Cell, 1985
- The entity-relationship model—toward a unified view of dataACM Transactions on Database Systems, 1976