Gene Structure Prediction Using an Orthologous Gene of Known Exon-Intron Structure
- 1 January 2004
- journal article
- Published by Springer Nature in Applied Bioinformatics
- Vol. 3 (2) , 81-90
- https://doi.org/10.2165/00822942-200403020-00002
Abstract
Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.Keywords
This publication has 9 references indexed in Scilit:
- Large-scale comparison of intron positions in mammalian genes shows intron loss but no gainProceedings of the National Academy of Sciences, 2003
- Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or lossNature Genetics, 2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002
- Comparative ab initio prediction of gene structures using pair HMMsBioinformatics, 2002
- Computational Inference of Homologous Gene Structures in the Human GenomeGenome Research, 2001
- Human and Mouse Gene Structure: Comparative Analysis and Application to Exon PredictionGenome Research, 2000
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choiceNucleic Acids Research, 1994
- Transition network grammars for natural language analysisCommunications of the ACM, 1970