Gene prediction and verification in a compact genome with numerous small introns
Open Access
- 12 October 2004
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 14 (11) , 2330-2335
- https://doi.org/10.1101/gr.2816704
Abstract
The genomes of clusters of related eukaryotes are now being sequenced at an increasing rate, creating a need for accurate, low-cost annotation of exon–intron structures. In this paper, we demonstrate that reverse transcription-polymerase chain reaction (RT–PCR) and direct sequencing based on predicted gene structures satisfy this need, at least for single-celled eukaryotes. The TWINSCAN gene prediction algorithm was adapted for the fungal pathogen Cryptococcus neoformans by using a precise model of intron lengths in combination with ungapped alignments between the genome sequences of the two closely related Cryptococcus varieties. This approach resulted in ∼60% of known genes being predicted exactly right at every coding base and splice site. When previously unannotated TWINSCAN predictions were tested by RT–PCR and direct sequencing, 75% of targets spanning two predicted introns were amplified and produced high-quality sequence. When targets spanning the complete predicted open reading frame were tested, 72% of them amplified and produced high-quality sequence. We conclude that sequencing a small number of expressed sequence tags (ESTs) to provide training data, running TWINSCAN on an entire genome, and then performing RT–PCR and direct sequencing on all of its predictions would be a cost-effective method for obtaining an experimentally verified genome annotation.Keywords
This publication has 14 references indexed in Scilit:
- The Status, Quality, and Expansion of the NIH Full-Length cDNA Project: The Mammalian Gene Collection (MGC)Genome Research, 2004
- Identification of Rat Genes by TWINSCAN Gene Prediction, RT–PCR, and Direct SequencingGenome Research, 2004
- Computational Gene Prediction Using Multiple Sources of EvidenceGenome Research, 2004
- C. elegans ORFeome version 1.1: experimental verification of the genome annotation and resource for proteome-scale protein expressionNature Genetics, 2003
- Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny MapGenome Research, 2003
- Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequencesProceedings of the National Academy of Sciences, 2002
- Phat—a gene finding program for Plasmodium falciparumMolecular and Biochemical Parasitology, 2001
- Primer3 on the WWW for General Users and for Biologist ProgrammersPublished by Springer Nature ,2000
- Interpolated Markov Models for Eukaryotic Gene FindingGenomics, 1999
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997