A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches

Open Access

2 May 2010

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 26 (12) , 1481-1487
https://doi.org/10.1093/bioinformatics/btq229

Abstract

Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined. Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g³) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g⁶). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g³log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs. Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/ Contact:dmk@stowers.org Supplementary information: Supplementary materials are available at Bioinformatics online.

Keywords

This publication has 42 references indexed in Scilit:

New dimensions of the virus world discovered through metagenomics
Trends in Microbiology, 2009
Database resources of the National Center for Biotechnology Information
Nucleic Acids Research, 2009
Calibrating the Tree of Life: fossils, molecules and evolutionary timescales
Annals of Botany, 2009
Algorithm of OMA for large-scale orthology inference
BMC Bioinformatics, 2008
EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates
Genome Research, 2008
Differential loss of embryonic globin genes during the radiation of placental mammals
Proceedings of the National Academy of Sciences, 2008
eggNOG: automated construction and annotation of orthologous groups of genes
Nucleic Acids Research, 2007
Automatic genome-wide reconstruction of phylogenetic gene trees
Bioinformatics, 2007
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
Genome Research, 2003
An efficient algorithm for large-scale detection of protein families
Nucleic Acids Research, 2002