Homology Induction: the use of machine learning to improve sequence similarity searches
Open Access
- 23 April 2002
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 3 (1) , 11
- https://doi.org/10.1186/1471-2105-3-11
Abstract
The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify approximately 50% of homologies (with a false positive rate set at 1/1000). We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodology with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. HI is a new technique for the detection of remote protein homology--a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.Keywords
This publication has 47 references indexed in Scilit:
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- Dynamic sequence databank searching with templates and multiple alignment 1 1Edited by J. KarnJournal of Molecular Biology, 1998
- Intermediate sequences increase the detection of homology between sequencesJournal of Molecular Biology, 1997
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- A Decision-Theoretic Generalization of On-Line Learning and an Application to BoostingJournal of Computer and System Sciences, 1997
- The use of the area under the ROC curve in the evaluation of machine learning algorithmsPattern Recognition, 1997
- A Sequence Property Approach to Searching Protein DatabasesJournal of Molecular Biology, 1995
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994
- Basic local alignment search toolJournal of Molecular Biology, 1990
- Refined three-dimensional structures of two cyanobacterial C-phycocyanins at 2.1 and 2.5 Å resolutionJournal of Molecular Biology, 1987