An unsupervised classification scheme for improving predictions of prokaryotic TIS

Open Access

9 March 2006

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 7 (1) , 121
https://doi.org/10.1186/1471-2105-7-121

Abstract

Background: Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes.Results: We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data fromE. coliandB. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance onP. aeruginosa,B. pseudomalleiandR. solanacearum.Conclusion: On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool »TICO«(TIs COrrector) which is publicly available from our web site.

Keywords

This publication has 26 references indexed in Scilit:

TICO: a tool for improving predictions of prokaryotic translation initiation sites
Bioinformatics, 2005
Accuracy improvement for identifying translation initiation sites in microbial genomes
Bioinformatics, 2004
FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences
Nucleic Acids Research, 2003
Genome sequence of the plant pathogen Ralstonia solanacearum
Nature, 2002
Improved microbial gene identification with GLIMMER
Nucleic Acids Research, 1999
The Complete Genome Sequence of Escherichia coli K-12
Science, 1997
Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K‐12
Electrophoresis, 1997
GENMARK: Parallel gene recognition for both DNA strands
Computers & Chemistry, 1993
Measuring the Accuracy of Diagnostic Systems
Science, 1988
The 3′-Terminal Sequence of Escherichia coli 16S Ribosomal RNA: Complementarity to Nonsense Triplets and Ribosome Binding Sites
Proceedings of the National Academy of Sciences, 1974