FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function

Open Access

8 February 2007

journal article
conference paper
Published by Springer Nature in BMC Ecology and Evolution

Vol. 7 (S1) , S12
https://doi.org/10.1186/1471-2148-7-s1-s12

Abstract

Background: Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure. By contrast, most common homolog detection methods are optimized for retrieving local homologs, and do not address this requirement. Results: We present FlowerPower, a novel clustering algorithm designed for the identification of global homologs as a precursor to structural phylogenomic analysis. Similar to methods such as PSIBLAST, FlowerPower employs an iterative approach to clustering sequences. However, rather than using a single HMM or profile to expand the cluster, FlowerPower identifies subfamilies using the SCI-PHY algorithm and then selects and aligns new homologs using subfamily hidden Markov models. FlowerPower is shown to outperform BLAST, PSI-BLAST and the UCSC SAM-Target 2K methods at discrimination between proteins in the same domain architecture class and those having different overall domain structures. Conclusion: Structural phylogenomic analysis enables biologists to avoid the systematic errors associated with annotation transfer; clustering sequences based on sharing the same domain architecture is a critical first step in this process. FlowerPower is shown to consistently identify homologous sequences having the same domain architecture as the query. Availability: FlowerPower is available as a webserver at http://phylogenomics.berkeley.edu/flowerpower/.

Keywords

This publication has 20 references indexed in Scilit:

SMART 5: domains in the context of genomes and networks
Nucleic Acids Research, 2006
Multi-domain Proteins in the Three Kingdoms of Life: Orphan Domains and Other Unassigned Regions
Journal of Molecular Biology, 2005
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Research, 2004
The Pfam protein families database
Nucleic Acids Research, 2004
Domain combinations in archaeal, eubacterial and eukaryotic proteomes
Journal of Molecular Biology, 2001
What is the value added by human intervention in protein structure prediction?
Proteins-Structure Function and Bioinformatics, 2001
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
Bioinformatics, 1996
Basic local alignment search tool
Journal of Molecular Biology, 1990