Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
Top Cited Papers
Open Access
- 24 February 2009
- journal article
- research article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 106 (8) , 2677-2682
- https://doi.org/10.1073/pnas.0813249106
Abstract
For comparison of whole-genome (genic + nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l -mer) frequency profiles (FFP) of whole genomes are used for comparison—a variation of a text or book comparison method, using word frequency profiles. In this approach it is critical to identify the optimal resolution range of l -mers for the given set of genomes compared. The optimum FFP method is applicable for comparing whole genomes or large genomic regions even when there are no common genes with high homology. We outline the method in 3 stages: ( i ) We first show how the optimal resolution range can be determined with English books which have been transformed into long character strings by removing all punctuation and spaces. ( ii ) Next, we test the robustness of the optimized FFP method at the nucleotide level, using a mutation model with a wide range of base substitutions and rearrangements. ( iii ) Finally, to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions.Keywords
This publication has 21 references indexed in Scilit:
- Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data SetsMolecular Biology and Evolution, 2008
- Genomics, biogeography, and the diversification of placental mammalsProceedings of the National Academy of Sciences, 2007
- Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot projectNature, 2007
- The Average Common Substring Approach to Phylogenomic ReconstructionJournal of Computational Biology, 2006
- Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequencesBioinformatics, 2005
- Caloramator viterbensis sp. nov., a novel thermophilic, glycerol-fermenting bacterium isolated from a hot spring in ItalyInternational Journal of Systematic and Evolutionary Microbiology, 2002
- Complexities of a Controversial PracticeScience, 2001
- Kaikoura tree theorems: Computing the maximum agreement subtreeInformation Processing Letters, 1993
- Divergence measures based on the Shannon entropyIEEE Transactions on Information Theory, 1991
- Indexing by latent semantic analysisJournal of the American Society for Information Science, 1990