Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information
- 24 March 2009
- journal article
- research article
- Published by Wiley in Proteins-Structure Function and Bioinformatics
- Vol. 77 (1) , 181-190
- https://doi.org/10.1002/prot.22429
Abstract
The prediction of 1D structural properties of proteins is an important step toward the prediction of protein structure and function, not only in the ab initio case but also when homology information to known structures is available. Despite this the vast majority of 1D predictors do not incorporate homology information into the prediction process. We develop a novel structural alignment method, SAMD, which we use to build alignments of putative remote homologues that we compress into templates of structural frequency profiles. We use these templates as additional input to ensembles of recursive neural networks, which we specialise for the prediction of query sequences that show only remote homology to any Protein Data Bank structure. We predict four 1D structural properties – secondary structure, relative solvent accessibility, backbone structural motifs, and contact density. Secondary structure prediction accuracy, tested by five-fold cross-validation on a large set of proteins allowing less than 25% sequence identity between training and test set and query sequences and templates, exceeds 82%, outperforming its ab initio counterpart, other state-of-the-art secondary structure predictors (Jpred 3 and PSIPRED) and two other systems based on PSI-BLAST and COMPASS templates. We show that structural information from homologues improves prediction accuracy well beyond the Twilight Zone of sequence similarity, even below 5% sequence identity, for all four structural properties. Significant improvement over the extraction of structural information directly from PDB templates suggests that the combination of sequence and template information is more informative than templates alone. Proteins 2009.Keywords
Funding Information
- Health Research Board of Ireland
- UCD President's Award 2004 (05/RFP/CMS0029, RP/2005/219)
This publication has 65 references indexed in Scilit:
- The Jpred 3 secondary structure prediction serverNucleic Acids Research, 2008
- Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure informationBMC Bioinformatics, 2007
- Evaluation of local structure alphabets based on residue burialProteins-Structure Function and Bioinformatics, 2004
- Within the twilight zone: a sensitive profile-profile comparison tool based on information theoryJournal of Molecular Biology, 2002
- Pairwise sequence alignment below the twilight zone11Edited by B. HonigJournal of Molecular Biology, 2001
- The Protein Data BankNucleic Acids Research, 2000
- GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequencesJournal of Molecular Biology, 1999
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Selection of representative protein data setsProtein Science, 1992
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresBiopolymers, 1983