A machine learning information retrieval approach to protein fold recognition

Open Access

17 March 2006

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 22 (12) , 1456-1463
https://doi.org/10.1093/bioinformatics/btl102

Abstract

Motivation: Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. Results: Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile–profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is ∼85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%. Availability: The FOLDpro server is available with the SCRATCH suite through . Contact:pfbaldi@ics.uci.edu Supplementary information: Supplementary data are available at

Keywords

This publication has 75 references indexed in Scilit:

3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments
Journal of Molecular Biology, 2004
Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition
Proteins-Structure Function and Bioinformatics, 2004
Within the twilight zone: a sensitive profile-profile comparison tool based on information theory
Journal of Molecular Biology, 2002
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure
Journal of Molecular Biology, 2001
T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
Journal of Molecular Biology, 2000
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
A new approach to protein fold recognition
Nature, 1992
Basic Local Alignment Search Tool
Journal of Molecular Biology, 1990
Basic local alignment search tool
Journal of Molecular Biology, 1990