Protein docking using surface matching and supervised machine learning

19 April 2007

journal article
research article
Published by Wiley in Proteins-Structure Function and Bioinformatics

Vol. 68 (2) , 488-502
https://doi.org/10.1002/prot.21406

Abstract

Computational prediction of protein complex structures through docking offers a means to gain a mechanistic understanding of protein interactions that mediate biological processes. This is particularly important as the number of experimentally determined structures of isolated proteins exceeds the number of structures of complexes. A comprehensive docking procedure is described in which efficient sampling of conformations is achieved by matching surface normal vectors, fast filtering for shape complementarity, clustering by RMSD, and scoring the docked conformations using a supervised machine learning approach. Contacting residue pair frequencies, residue propensities, evolutionary conservation, and shape complementarity score for each docking conformation are used as input data to a Random Forest classifier. The performance of the Random Forest approach for selecting correctly docked conformations was assessed by cross‐validation using a nonredundant benchmark set of X‐ray structures for 93 heterodimer and 733 homodimer complexes. The single highest rank docking solution was the correct (near‐native) structure for slightly more than one third of the complexes. Furthermore, the fraction of highly ranked correct structures was significantly higher than the overall fraction of correct structures, for almost all complexes. A detailed analysis of the difficult to predict complexes revealed that the majority of the homodimer cases were explained by incorrect oligomeric state annotation. Evolutionary conservation and shape complementarity score as well as both underrepresented and overrepresented residue types and residue pairs were found to make the largest contributions to the overall prediction accuracy. Finally, the method was also applied to docking unbound subunit structures from a previously published benchmark set. Proteins 2007.

Keywords

This publication has 51 references indexed in Scilit:

The Universal Protein Resource (UniProt)
Nucleic Acids Research, 2006
The Impact of Structural Genomics: Expectations and Outcomes
Science, 2006
Adaptation of a fast Fourier transform‐based docking algorithm for protein design
Journal of Computational Chemistry, 2005
Identification of Protein–Protein Interaction Sites from Docking Energy Landscapes
Journal of Molecular Biology, 2003
The Protein Data Bank
Nucleic Acids Research, 2000
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Correlated mutations contain information about protein-protein interaction 1 1Edited by A. R. Fersht
Journal of Molecular Biology, 1997
A geometry-based suite of moleculardocking processes
Journal of Molecular Biology, 1995
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
Efficient algorithms for agglomerative hierarchical clustering methods
Journal of Classification, 1984