Will my protein crystallize? A sequence‐based predictor
- 28 November 2005
- journal article
- research article
- Published by Wiley in Proteins-Structure Function and Bioinformatics
- Vol. 62 (2) , 343-355
- https://doi.org/10.1002/prot.20789
Abstract
We propose a machine‐learning approach to sequence‐based prediction of protein crystallizability in which we exploit subtle differences between proteins whose structures were solved by X‐ray analysis [or by both X‐ray and nuclear magnetic resonance (NMR) spectroscopy] and those proteins whose structures were solved by NMR spectroscopy alone. Because the NMR technique is usually applied on relatively small proteins, sequence length distributions of the X‐ray and NMR datasets were adjusted to avoid predictions biased by protein size. As feature space for classification, we used frequencies of mono‐, di‐, and tripeptides represented by the original 20‐letter amino acid alphabet as well as by several reduced alphabets in which amino acids were grouped by their physicochemical and structural properties. The classification algorithm was constructed as a two‐layered structure in which the output of primary support vector machine classifiers operating on peptide frequencies was combined by a second‐level Naive Bayes classifier. Due to the application of metamethods for cost sensitivity, our method is able to handle real datasets with unbalanced class representation. An overall prediction accuracy of 67% [65% on the positive (crystallizable) and 69% on the negative (noncrystallizable) class] was achieved in a 10‐fold cross‐validation experiment, indicating that the proposed algorithm may be a valuable tool for more efficient target selection in structural genomics. A Web server for protein crystallizability prediction called SECRET is available athttp://webclu.bio.wzw.tum.de:8080/secret. Proteins 2006.Keywords
This publication has 47 references indexed in Scilit:
- Protein Biophysical Properties that Correlate with Crystallization Success in Thermotoga maritima: Maximum Clustering Strategy for Structural GenomicsJournal of Molecular Biology, 2004
- Low-populated folding intermediates of Fyn SH3 characterized by relaxation dispersion NMRNature, 2004
- UniProt: the Universal Protein knowledgebaseNucleic Acids Research, 2004
- Improvements to Platt's SMO Algorithm for SVM Classifier DesignNeural Computation, 2001
- The Protein Data BankNucleic Acids Research, 2000
- Wrappers for feature subset selectionArtificial Intelligence, 1997
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Selection of representative protein data setsProtein Science, 1992
- Hydrophobicity of Amino Acid Residues in Globular ProteinsScience, 1985
- A simple method for displaying the hydropathic character of a proteinJournal of Molecular Biology, 1982