Automated Gene Identification in Large-Scale Genomic Sequences1
- 1 January 1997
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 4 (3) , 325-338
- https://doi.org/10.1089/cmb.1997.4.325
Abstract
Computational methods for gene identification in genomic sequences typically have two phases: coding region recognition and gene parsing. While there are a number of effective methods for recognizing coding regions (exons), parsing the recognized exons into proper gene structures, to a large extent, remains an unsolved problem. We have developed a computer program which can automatically parse the recognized exons into gene models that are most consistent with the available Expressed Sequence Tags (ESTs) and a set of biological heuristics, derived empirically. The gene modeling algorithm used in this program provides a general framework for applying EST information so the modeling accuracy improves as the amount of available EST information increases. Based on preliminary tests on a number of large DNA sequences, using the dbEST database, we have observed that the algorithm can (1) accurately model complicated multiple gene structures, including embedded genes, (2) identify falsely-recognized exons and locate missed exons by the initial exon recognition phase, and (3) make more accurate exon boundary predictions, if the necessary EST information is available. We have extended this EST-based gene modeling algorithm to model genes on unfinished DNA contigs at the end of the shotgun sequencing. This extended version can automatically determine the orientations and the relative order of the DNA contigs (with gaps between them) using the available ESTs as reference models, before the gene modeling phase. Key words: multiple gene structure prediction, expressed sequence tags, sequence comparison and analysis, pattern recognition, and dynamic programming.Keywords
This publication has 19 references indexed in Scilit:
- GRAIL: a multi-agent neural network system for gene identificationProceedings of the IEEE, 1996
- Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data.Genome Research, 1996
- Evaluation of Gene Structure Prediction ProgramsGenomics, 1996
- The Genexpress Index: a resource for gene discovery and the genic map of the human genome.Genome Research, 1995
- Gene Structure Prediction by Linguistic MethodsGenomics, 1994
- dbEST — database for “expressed sequence tags”Nature Genetics, 1993
- Prediction of the exon-intron structure by a dynamic programming approachBiosystems, 1993
- Prediction of gene structureJournal of Molecular Biology, 1992
- Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome ProjectScience, 1991
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990