Steady progress and recent breakthroughs in the accuracy of automated genome annotation

1 January 2008

journal article
review article
Published by Springer Nature in Nature Reviews Genetics

Vol. 9 (1) , 62-73
https://doi.org/10.1038/nrg2220

Abstract

It is not currently possible to determine the precise structure of every protein-coding gene in a complex, eukaryotic genome. However, the past 10 years have seen steady progress in the accuracy and completeness of methods for automated genome annotation. Currently, the gold standard in the annotation of exon–intron structures is the alignment of a full-length cDNA sequence to the sequence of the genomic region from which it was transcribed. For a significant fraction of genes, it is not practical to obtain full-length cDNA sequences by sequencing randomly selected cDNA clones or by screening clone libraries. Some of these genes can be accurately annotated by aligning the sequence of a cDNA (or its translation) to a very similar genomic region other than the one from which it was transcribed. The first driver of recent improvements in annotation is the sequencing of many genomes that can be compared with one another, a trend that is likely to continue. A second source of improvement is the development of better probability models for de novo gene prediction, most recently those based on the conditional random field modelling framework. A third significant source of improvement in mammalian genome annotation has been the development of software for automatically detecting processed pseudogenes. By designing PCR primers for predicted cDNA sequences, it is possible to specifically amplify and sequence thousands of cDNAs, the sequences of which could not be obtained by traditional methods. By using a combiner program to adjudicate among predictions and alignments produced by several methods, one can now come closer than ever before to producing complete and accurate gene catalogues.

Keywords

This publication has 57 references indexed in Scilit:

Evolution of genes and genomes on the Drosophila phylogeny
Nature, 2007
Targeted discovery of novel human exons by comparative genomics
Genome Research, 2007
Conrad: Gene prediction using conditional random fields
Genome Research, 2007
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
Nature, 2007
Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron–exon structure
Bioinformatics, 2007
Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies
Nucleic Acids Research, 2003
Sequencing and comparison of yeast species to identify genes and regulatory elements
Nature, 2003
Initial sequencing and comparative analysis of the mouse genome
Nature, 2002
BLAT—The BLAST-Like Alignment Tool
Genome Research, 2002
Prediction of complete gene structures in human genomic DNA
Journal of Molecular Biology, 1997