Gene recognition via spliced sequence alignment.

20 August 1996

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences

Vol. 93 (17) , 9061-9066
https://doi.org/10.1073/pnas.93.17.9061

Abstract

Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics, and applications of combinatorial methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way toward a new approach to gene recognition that uses previously sequenced genes as a clue for recognition of newly sequenced genes. This paper describes a spliced alignment algorithm and software tool that explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives, the average correlation between the predicted and actual proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused either by short (less than 5 amino acids) initial/terminal exons or by alternative splicing. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is nonvertebrate or even prokaryotic. The surprisingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins at 160 accepted point mutations (PAM) (25% similarity), the correlation between the predicted and actual genes was still as high as 95%.

Keywords

This publication has 22 references indexed in Scilit:

Amino acid substitution matrices from an information theoretic perspective
Published by Elsevier ,2005
Identification of Protein Coding Regions In Genomic DNA
Journal of Molecular Biology, 1995
Prediction of Function in DNA Sequence Analysis
Journal of Computational Biology, 1995
Gene Structure Prediction by Linguistic Methods
Genomics, 1994
Identification of protein coding regions by database similarity search
Nature Genetics, 1993
Efficient optimal decomposition of a sequence into disjoint regions, each matched to some template in an inventory
Mathematical Biosciences, 1992
Prediction of gene structure
Journal of Molecular Biology, 1992
Analysis of insertions/deletions in protein structures
Journal of Molecular Biology, 1992
The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules.
Published by Elsevier ,1991
Basic local alignment search tool
Journal of Molecular Biology, 1990