Computational prediction of eukaryotic protein-coding genes

1 September 2002

journal article
review article
Published by Springer Nature in Nature Reviews Genetics

Vol. 3 (9) , 698-709
https://doi.org/10.1038/nrg890

Abstract

With the recent explosion in the availability of genome data, gene-finding programs have proliferated. However, the accuracy with which genes can be predicted is still far from satisfactory. This review provides background information and surveys the latest developments in gene-prediction programs. It also highlights the problems that face the gene-prediction field and discusses future research goals. The main characteristic of a eukaryotic gene is its organization into exons and introns. The 'exon-definition' model explains how the splicing machinery recognizes exons in a sea of intronic DNA. It indicates that an internal exon is initially recognized by a chain of interacting splicing factors that span it. The binding of these factors to pre-mRNA is responsible for the non-random nucleotide patterns that form the molecular basis of all exon-recognition algorithms. Correctly identifying the boundaries of a gene is essential when searching for several genes in a large genomic region. It is relatively easy to find internal exons, but many gene-prediction programs fail to identify gene boundaries. Determining the 3′ end of a gene is easier than determining its 5′ end, mainly because of the difficulty of identifying the promoter and transcriptional start-site sequences, and because the 5′ ends of cDNA sequences are often truncated. As current gene-prediction programs are biased towards intron-containing genes, many intronless genes might have been missed by such programs. Many false-positive exon predictions have also been caused by pseudogenes. Developing better and more specialized algorithms to recognize them is becoming increasingly important. Hidden Markov model (HMM)-based programs can be used to predict multiple genes, partial genes and genes on both strands, all at the same time. These features are essential when annotating genomes or large chunks of sequence data, such as large contigs, in an automated fashion. By comparing the genomes of several closely related species, conserved regulatory regions can be identified easily. For these reasons, making use of comparative genomic data is an important future challenge for the gene-prediction field. More functional genomics methods for finding genes are desperately needed to improve gene prediction. Only with sufficient mechanistic data can gene prediction be transformed from being statistical to being biological in nature. The field is working towards the ultimate dynamic model that can identify the consecutive exons of a gene, from its 5′ to its 3′ ends, as if they were being co-transcriptionally recognized and spliced.

Keywords

This publication has 75 references indexed in Scilit:

Correct Identification of Genes from Serial Analysis of Gene Expression Tag Sequences
Genomics, 2002
Listening to silence and understanding nonsense: exonic mutations that affect splicing
Nature Reviews Genetics, 2002
Carbohydrate microarrays for the recognition of cross-reactive molecular markers of microbes and host cells
Nature Biotechnology, 2002
Insulator: from chromatin domain boundary to gene regulation
Human Genetics, 2001
Computational Inference of Homologous Gene Structures in the Human Genome
Genome Research, 2001
Evaluation of Gene-Finding Programs on Mammalian Sequences
Genome Research, 2001
Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach
Journal of Molecular Biology, 2000
Prediction of complete gene structures in human genomic DNA
Journal of Molecular Biology, 1997
Identification of protein coding regions by database similarity search
Nature Genetics, 1993
A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE, 1989