Domain Architecture Comparison for Multidomain Homology Identification
- 1 May 2007
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 14 (4) , 496-516
- https://doi.org/10.1089/cmb.2007.a009
Abstract
Homology identification is the first step for many genomic studies. Current methods, based on sequence comparison, can result in a substantial number of mis-assignments due to the similarity of homologous domains in otherwise unrelated sequences. Here we propose methods to detect homologs through explicit comparison of protein domain content. We developed several schemes for scoring the homology of a pair of protein sequences based on methods used in the field of information retrieval. We evaluate the proposed methods and methods used in the literature using a benchmark of fifteen sequence families of known evolutionary history. The results of these studies demonstrate the effectiveness of comparing domain architectures using these similarity measures. We also demonstrate the importance of both weighting promiscuous domains and of compensating for the statistical effect of having a large number of domains in a protein. Using logistic regression, we demonstrate the benefit of combining similarity measures based on domain content with sequence similarity measures.Keywords
This publication has 101 references indexed in Scilit:
- Modeling the Evolution of Protein Domain Architectures Using Maximum ParsimonyJournal of Molecular Biology, 2006
- The Pfam protein families databaseNucleic Acids Research, 2004
- CDART: Protein Homology by Domain ArchitectureGenome Research, 2002
- The geometry of domain combination in proteins 1 1Edited by J. ThorntonJournal of Molecular Biology, 2002
- Gene Duplication and EvolutionScience, 2001
- Domain combinations in archaeal, eubacterial and eukaryotic proteomesJournal of Molecular Biology, 2001
- Initial sequencing and analysis of the human genomeNature, 2001
- Counting on comparative mapsTrends in Genetics, 1998
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Evolution of the proteases of blood coagulation and fibrinolysis by assembly from modulesCell, 1985