Accurate prediction of solvent accessibility using neural networks–based regression

20 May 2004

journal article
research article
Published by Wiley in Proteins-Structure Function and Bioinformatics

Vol. 56 (4) , 753-767
https://doi.org/10.1002/prot.20176

Abstract

Accurate prediction of relative solvent accessibilities (RSAs) of amino acid residues in proteins may be used to facilitate protein structure prediction and functional annotation. Toward that goal we developed a novel method for improved prediction of RSAs. Contrary to other machine learning–based methods from the literature, we do not impose a classification problem with arbitrary boundaries between the classes. Instead, we seek a continuous approximation of the real‐value RSA using nonlinear regression, with several feed forward and recurrent neural networks, which are then combined into a consensus predictor. A set of 860 protein structures derived from the PFAM database was used for training, whereas validation of the results was carefully performed on several nonredundant control sets comprising a total of 603 structures derived from new Protein Data Bank structures and had no homology to proteins included in the training. Two classes of alternative predictors were developed for comparison with the regression‐based approach: one based on the standard classification approach and the other based on a semicontinuous approximation with the so‐called thermometer encoding. Furthermore, a weighted approximation, with errors being scaled by the observed levels of variability in RSA for equivalent residues in families of homologous structures, was applied in order to improve the results. The effects of including evolutionary profiles and the growth of sequence databases were assessed. In accord with the observed levels of variability in RSA for different ranges of RSA values, the regression accuracy is higher for buried than for exposed residues, with overall 15.3–15.8% mean absolute errors and correlation coefficients between the predicted and experimental values of 0.64–0.67 on different control sets. The new method outperforms classification‐based algorithms when the real value predictions are projected onto two‐class classification problems with several commonly used thresholds to separate exposed and buried residues. For example, classification accuracy of about 77% is consistently achieved on all control sets with a threshold of 25% RSA. A web server that enables RSA prediction using the new method and provides customizable graphical representation of the results is available at http://sable.cchmc.org. Proteins 2004.

Keywords

This publication has 31 references indexed in Scilit:

Prediction of protein relative solvent accessibility with support vector machines and long‐range interaction 3D local descriptor
Proteins-Structure Function and Bioinformatics, 2003
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
Nucleic Acids Research, 2003
EVA: continuous automatic evaluation of protein structure prediction servers
Bioinformatics, 2001
Review: Protein Secondary Structure Prediction Continues to Rise
Journal of Structural Biology, 2001
The Protein Data Bank
Nucleic Acids Research, 2000
Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne
Journal of Molecular Biology, 1999
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions
Journal of Molecular Biology, 1997
Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features
Biopolymers, 1983
The nature of the accessible and buried surfaces in proteins
Journal of Molecular Biology, 1976