Accurate prediction of solvent accessibility using neural networks–based regression
- 20 May 2004
- journal article
- research article
- Published by Wiley in Proteins-Structure Function and Bioinformatics
- Vol. 56 (4) , 753-767
- https://doi.org/10.1002/prot.20176
Abstract
Accurate prediction of relative solvent accessibilities (RSAs) of amino acid residues in proteins may be used to facilitate protein structure prediction and functional annotation. Toward that goal we developed a novel method for improved prediction of RSAs. Contrary to other machine learning–based methods from the literature, we do not impose a classification problem with arbitrary boundaries between the classes. Instead, we seek a continuous approximation of the real‐value RSA using nonlinear regression, with several feed forward and recurrent neural networks, which are then combined into a consensus predictor. A set of 860 protein structures derived from the PFAM database was used for training, whereas validation of the results was carefully performed on several nonredundant control sets comprising a total of 603 structures derived from new Protein Data Bank structures and had no homology to proteins included in the training. Two classes of alternative predictors were developed for comparison with the regression‐based approach: one based on the standard classification approach and the other based on a semicontinuous approximation with the so‐called thermometer encoding. Furthermore, a weighted approximation, with errors being scaled by the observed levels of variability in RSA for equivalent residues in families of homologous structures, was applied in order to improve the results. The effects of including evolutionary profiles and the growth of sequence databases were assessed. In accord with the observed levels of variability in RSA for different ranges of RSA values, the regression accuracy is higher for buried than for exposed residues, with overall 15.3–15.8% mean absolute errors and correlation coefficients between the predicted and experimental values of 0.64–0.67 on different control sets. The new method outperforms classification‐based algorithms when the real value predictions are projected onto two‐class classification problems with several commonly used thresholds to separate exposed and buried residues. For example, classification accuracy of about 77% is consistently achieved on all control sets with a threshold of 25% RSA. A web server that enables RSA prediction using the new method and provides customizable graphical representation of the results is available at http://sable.cchmc.org. Proteins 2004.Keywords
This publication has 31 references indexed in Scilit:
- Prediction of protein relative solvent accessibility with support vector machines and long‐range interaction 3D local descriptorProteins-Structure Function and Bioinformatics, 2003
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- EVA: continuous automatic evaluation of protein structure prediction serversBioinformatics, 2001
- Review: Protein Secondary Structure Prediction Continues to RiseJournal of Structural Biology, 2001
- The Protein Data BankNucleic Acids Research, 2000
- Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von HeijneJournal of Molecular Biology, 1999
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functionsJournal of Molecular Biology, 1997
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresBiopolymers, 1983
- The nature of the accessible and buried surfaces in proteinsJournal of Molecular Biology, 1976