Computational prediction of eukaryotic protein-coding genes
- 1 September 2002
- journal article
- review article
- Published by Springer Nature in Nature Reviews Genetics
- Vol. 3 (9) , 698-709
- https://doi.org/10.1038/nrg890
Abstract
With the recent explosion in the availability of genome data, gene-finding programs have proliferated. However, the accuracy with which genes can be predicted is still far from satisfactory. This review provides background information and surveys the latest developments in gene-prediction programs. It also highlights the problems that face the gene-prediction field and discusses future research goals. The main characteristic of a eukaryotic gene is its organization into exons and introns. The 'exon-definition' model explains how the splicing machinery recognizes exons in a sea of intronic DNA. It indicates that an internal exon is initially recognized by a chain of interacting splicing factors that span it. The binding of these factors to pre-mRNA is responsible for the non-random nucleotide patterns that form the molecular basis of all exon-recognition algorithms. Correctly identifying the boundaries of a gene is essential when searching for several genes in a large genomic region. It is relatively easy to find internal exons, but many gene-prediction programs fail to identify gene boundaries. Determining the 3′ end of a gene is easier than determining its 5′ end, mainly because of the difficulty of identifying the promoter and transcriptional start-site sequences, and because the 5′ ends of cDNA sequences are often truncated. As current gene-prediction programs are biased towards intron-containing genes, many intronless genes might have been missed by such programs. Many false-positive exon predictions have also been caused by pseudogenes. Developing better and more specialized algorithms to recognize them is becoming increasingly important. Hidden Markov model (HMM)-based programs can be used to predict multiple genes, partial genes and genes on both strands, all at the same time. These features are essential when annotating genomes or large chunks of sequence data, such as large contigs, in an automated fashion. By comparing the genomes of several closely related species, conserved regulatory regions can be identified easily. For these reasons, making use of comparative genomic data is an important future challenge for the gene-prediction field. More functional genomics methods for finding genes are desperately needed to improve gene prediction. Only with sufficient mechanistic data can gene prediction be transformed from being statistical to being biological in nature. The field is working towards the ultimate dynamic model that can identify the consecutive exons of a gene, from its 5′ to its 3′ ends, as if they were being co-transcriptionally recognized and spliced.Keywords
This publication has 75 references indexed in Scilit:
- Correct Identification of Genes from Serial Analysis of Gene Expression Tag SequencesGenomics, 2002
- Listening to silence and understanding nonsense: exonic mutations that affect splicingNature Reviews Genetics, 2002
- Carbohydrate microarrays for the recognition of cross-reactive molecular markers of microbes and host cellsNature Biotechnology, 2002
- Insulator: from chromatin domain boundary to gene regulationHuman Genetics, 2001
- Computational Inference of Homologous Gene Structures in the Human GenomeGenome Research, 2001
- Evaluation of Gene-Finding Programs on Mammalian SequencesGenome Research, 2001
- Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approachJournal of Molecular Biology, 2000
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- Identification of protein coding regions by database similarity searchNature Genetics, 1993
- A tutorial on hidden Markov models and selected applications in speech recognitionProceedings of the IEEE, 1989