Inferring Function Using Patterns of Native Disorder in Proteins

Open Access

24 August 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 3 (8) , e162
https://doi.org/10.1371/journal.pcbi.0030162

Abstract

Natively unstructured regions are a common feature of eukaryotic proteomes. Between 30% and 60% of proteins are predicted to contain long stretches of disordered residues, and not only have many of these regions been confirmed experimentally, but they have also been found to be essential for protein function. In this study, we directly address the potential contribution of protein disorder in predicting protein function using standard Gene Ontology (GO) categories. Initially we analyse the occurrence of protein disorder in the human proteome and report ontology categories that are enriched in disordered proteins. Pattern analysis of the distributions of disordered regions in human sequences demonstrated that the functions of intrinsically disordered proteins are both length- and position-dependent. These dependencies were then encoded in feature vectors to quantify the contribution of disorder in human protein function prediction using Support Vector Machine classifiers. The prediction accuracies of 26 GO categories relating to signalling and molecular recognition are improved using the disorder features. The most significant improvements were observed for kinase, phosphorylation, growth factor, and helicase categories. Furthermore, we provide predicted GO term assignments using these classifiers for a set of unannotated and orphan human proteins. In this study, the importance of capturing protein disorder information and its value in function prediction is demonstrated. The GO category classifiers generated can be used to provide more reliable predictions and further insights into the behaviour of orphan and unannotated proteins. As a result of high throughput sequencing technologies, there is a growing need to provide fast and accurate computational tools to predict the function of proteins from amino acid sequence. Most methods that attempt to do this rely on transferring function annotations between closely related proteins; however, a large proportion of unannotated proteins are orphans and do not share sufficient similarity to other proteins to be annotated in this way. Methods that target the annotation of these difficult proteins are feature-based methods and utilise relationships between the physical characteristics of proteins and function to make predictions. One important characteristic of proteins that remains unexploited in these feature-based methods is native structural disorder. Disordered regions of proteins are thought to adopt little or no regular structure and have been experimentally linked with the correct functioning of many proteins. Additionally, disordered regions of proteins can be successfully predicted from amino acid sequence. To address the requirement for protein function prediction methods that target the annotation of orphan proteins and explore the use of information describing protein disorder, a machine learning method for predicting protein function from sequence has been implemented. The inclusion of disorder features significantly improves prediction accuracies for many function categories relating to molecular recognition. The practical utility of the method is also demonstrated by providing annotations for a set of orphan and unannotated human proteins.

Keywords

This publication has 56 references indexed in Scilit:

Disorder and Sequence Repeats in Hub Proteins and Their Implications for Network Evolution
Journal of Proteome Research, 2006
Abundance of Intrinsic Disorder in Protein Associated with Cardiovascular Disease
Biochemistry, 2006
Intrinsic Disorder in Transcription Factors
Biochemistry, 2006
Conservation of Intrinsic Disorder in Protein Domains and Families: II. Functions of Conserved Disorder
Journal of Proteome Research, 2006
Intrinsically unstructured proteins and their functions
Nature Reviews Molecular Cell Biology, 2005
The International Protein Index: An integrated database for proteomics experiments
Proteomics, 2004
Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence
Proteomics, 2004
Automatic prediction of protein function
Cellular and Molecular Life Sciences, 2003
Predicting intrinsic disorder from amino acid sequence
Proteins-Structure Function and Bioinformatics, 2003
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997