Steady progress and recent breakthroughs in the accuracy of automated genome annotation
- 1 January 2008
- journal article
- review article
- Published by Springer Nature in Nature Reviews Genetics
- Vol. 9 (1) , 62-73
- https://doi.org/10.1038/nrg2220
Abstract
It is not currently possible to determine the precise structure of every protein-coding gene in a complex, eukaryotic genome. However, the past 10 years have seen steady progress in the accuracy and completeness of methods for automated genome annotation. Currently, the gold standard in the annotation of exon–intron structures is the alignment of a full-length cDNA sequence to the sequence of the genomic region from which it was transcribed. For a significant fraction of genes, it is not practical to obtain full-length cDNA sequences by sequencing randomly selected cDNA clones or by screening clone libraries. Some of these genes can be accurately annotated by aligning the sequence of a cDNA (or its translation) to a very similar genomic region other than the one from which it was transcribed. The first driver of recent improvements in annotation is the sequencing of many genomes that can be compared with one another, a trend that is likely to continue. A second source of improvement is the development of better probability models for de novo gene prediction, most recently those based on the conditional random field modelling framework. A third significant source of improvement in mammalian genome annotation has been the development of software for automatically detecting processed pseudogenes. By designing PCR primers for predicted cDNA sequences, it is possible to specifically amplify and sequence thousands of cDNAs, the sequences of which could not be obtained by traditional methods. By using a combiner program to adjudicate among predictions and alignments produced by several methods, one can now come closer than ever before to producing complete and accurate gene catalogues.Keywords
This publication has 57 references indexed in Scilit:
- Evolution of genes and genomes on the Drosophila phylogenyNature, 2007
- Targeted discovery of novel human exons by comparative genomicsGenome Research, 2007
- Conrad: Gene prediction using conditional random fieldsGenome Research, 2007
- Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot projectNature, 2007
- Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron–exon structureBioinformatics, 2007
- Improving the Arabidopsis genome annotation using maximal transcript alignment assembliesNucleic Acids Research, 2003
- Sequencing and comparison of yeast species to identify genes and regulatory elementsNature, 2003
- Initial sequencing and comparative analysis of the mouse genomeNature, 2002
- BLAT—The BLAST-Like Alignment ToolGenome Research, 2002
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997