Predicting DNA-binding sites of proteins from amino acid sequence
Open Access
- 19 May 2006
- journal article
- research article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 7 (1) , 262
- https://doi.org/10.1186/1471-2105-7-262
Abstract
Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions. We start with a Naïve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions. Naïve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs.Keywords
This publication has 30 references indexed in Scilit:
- Prediction of RNA binding sites in proteins from amino acid sequenceRNA, 2006
- Identification of amino acids important for target recognition by the DNA:m5C methyltransferase M.NgoPII by alanine‐scanning mutagenesis of residues at the protein–DNA interfaceProteins-Structure Function and Bioinformatics, 2004
- A two-stage classifier for identification of protein–protein interface residuesBioinformatics, 2004
- Identification of interface residues in protease-inhibitor and antigen-antibody complexes: a support vector machine approachNeural Computing & Applications, 2004
- Catabolite activator protein: DNA binding and transcription activationCurrent Opinion in Structural Biology, 2004
- A Comprehensive Alanine Scanning Mutagenesis of the Escherichia coli Transcriptional Activator SoxS: Identifying Amino Acids Important for DNA Binding and Transcription ActivationJournal of Molecular Biology, 2002
- Rapid grid‐based construction of the molecular surface and the use of induced surface charge to calculate reaction field energies: Applications to the molecular systems and geometric objectsJournal of Computational Chemistry, 2001
- The Protein Data BankNucleic Acids Research, 2000
- Prediction of protein-protein interaction sites using patch analysis 1 1Edited by G. von HeijneJournal of Molecular Biology, 1997
- TRANSCRIPTION FACTORS: Structural Families and Principles of DNA RecognitionAnnual Review of Biochemistry, 1992