Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures
Open Access
- 10 January 2006
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 22 (5) , 517-522
- https://doi.org/10.1093/bioinformatics/btk029
Abstract
Motivation: Analyses of genomic signatures are gaining attention as they allow studies of species-specific relationships without involving alignments of homologous sequences. A naïve Bayesian classifier was built to discriminate between different bacterial compositions of short oligomers, also known as DNA words. The classifier has proven successful in identifying foreign genes in Neisseria meningitis. In this study we extend the classifier approach using either a fixed higher order Markov model (Mk) or a variable length Markov model (VLMk). Results: We propose a simple algorithm to lock a variable length Markov model to a certain number of parameters and show that the use of Markov models greatly increases the flexibility and accuracy in prediction to that of a naïve model. We also test the integrity of classifiers in terms of false-negatives and give estimates of the minimal sizes of training data. We end the report by proposing a method to reject a false hypothesis of horizontal gene transfer. Availability: Software and Supplementary information available at Contact:dalevi@cs.chalmers.seKeywords
This publication has 32 references indexed in Scilit:
- The spectrum of genomic signatures: from dinucleotides to chaos game representationGene, 2005
- Application of tetranucleotide frequencies for the assignment of genomic fragmentsEnvironmental Microbiology, 2004
- Biased biological functions of horizontally transferred genes in prokaryotic genomesNature Genetics, 2004
- Genomic Conflict Settled in Favour of the Species Rather Than the Gene at Extreme GC Percentage ValuesApplied Bioinformatics, 2004
- Evolutionary Implications of Microbial Genome Tetranucleotide Frequency BiasesGenome Research, 2003
- Capturing Whole-Genome Characteristics in Short Sequences Using a Naïve Bayesian ClassifierGenome Research, 2001
- Probabilistic and Statistical Properties of Words: An OverviewJournal of Computational Biology, 2000
- Distribution of Restriction Enzyme Recognition Sequences on Broad Host Range Plasmid RP4: Molecular and Evolutionary ImplicationsJournal of Molecular Biology, 1996
- Dinucleotide relative abundance extremes: a genomic signatureTrends in Genetics, 1995
- Codon usage and genome evolutionCurrent Opinion in Genetics & Development, 1994