Incorporation of splice site probability models for non-canonical introns improves gene structure prediction in plants
Open Access
- 1 November 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (Suppl_3) , iii20-iii30
- https://doi.org/10.1093/bioinformatics/bti1205
Abstract
Motivation: The vast majority of introns in protein-coding genes of higher eukaryotes have a GT dinucleotide at their 5′-terminus and an AG dinucleotide at their 3′ end. About 1–2% of introns are non-canonical, with the most abundant subtype of non-canonical introns being characterized by GC and AG dinucleotides at their 5′- and 3′-termini, respectively. Most current gene prediction software, whether based on ab initio or spliced alignment approaches, does not include explicit models for non-canonical introns or may exclude their prediction altogether. With present amounts of genome and transcript data, it is now possible to apply statistical methodology to non-canonical splice site prediction. We pursued one such approach and describe the training and implementation of GC-donor splice site models for Arabidopsis and rice, with the goal of exploring whether specific modeling of non-canonical introns can enhance gene structure prediction accuracy. Results: Our results indicate that the incorporation of non-canonical splice site models yields dramatic improvements in annotating genes containing GC–AG and AT–AC non-canonical introns. Comparison of models shows differences between monocot and dicot species, but also suggests GC intron-specific biases independent of taxonomic clade. We also present evidence that GC–AG introns occur preferentially in genes with atypically high exon counts. Availability: Source code for the updated versions of GeneSeqer and SplicePredictor (distributed with the GeneSeqer code) isavailable at . Web servers for Arabidopsis, rice and other plant species are accessible at , and , respectively. A SplicePredictor web server is available at . Software to generate training data and parameterizations for Bayesian splice site models is available at Contact:vbrendel@iastate.edu Supporting information:Keywords
This publication has 0 references indexed in Scilit: