Iterative gene prediction and pseudogene removal improves genome annotation
- 1 May 2006
- journal article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 16 (5) , 678-685
- https://doi.org/10.1101/gr.4766206
Abstract
Correct gene prediction is impaired by the presence of processed pseudogenes: nonfunctional, intronless copies of real genes found elsewhere in the genome. Gene prediction programs frequently mistake processed pseudogenes for real genes or exons, leading to biologically irrelevant gene predictions. While methods exist to identify processed pseudogenes in genomes, no attempt has been made to integrate pseudogene removal with gene prediction, or even to provide a freestanding tool that identifies such erroneous gene predictions. We have created PPFINDER (for Processed Pseudogene finder), a program that integrates several methods of processed pseudogene finding in mammalian gene annotations. We used PPFINDER to remove pseudogenes from N-SCAN gene predictions, and show that gene prediction improves substantially when gene prediction and pseudogene masking are interleaved. In addition, we used PPFINDER with gene predictions as a parent database, eliminating the need for libraries of known genes. This allows us to run the gene prediction/PPFINDER procedure on newly sequenced genomes for which few genes are known.Keywords
This publication has 31 references indexed in Scilit:
- Using Multiple Alignments to Improve Gene PredictionJournal of Computational Biology, 2006
- Integrated Pseudogene Annotation for Human Chromosome 22: Evidence for TranscriptionJournal of Molecular Biology, 2005
- NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteinsNucleic Acids Research, 2004
- The DNA sequence of human chromosome 7Nature, 2003
- Comparative Gene Prediction in Human and MouseGenome Research, 2003
- Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny MapGenome Research, 2003
- The UCSC Genome Browser DatabaseNucleic Acids Research, 2003
- Identification and Analysis of Over 2000 Ribosomal Protein Pseudogenes in the Human GenomeGenome Research, 2002
- The Human Genome Browser at UCSCGenome Research, 2002
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997