Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)
Top Cited Papers
Open Access
- 12 October 2005
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 33 (18) , 5799-5808
- https://doi.org/10.1093/nar/gki885
Abstract
We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 angstrom around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVMlight was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for < 6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity.Keywords
This publication has 38 references indexed in Scilit:
- Molecular diagnosis - Classification, model selection and performance evaluation2005
- The characterization of amino acid sequences in proteins by statistical methodsPublished by Elsevier ,2004
- MUSCLE: a multiple sequence alignment method with reduced time and space complexityBMC Bioinformatics, 2004
- The barbamide biosynthetic gene cluster: a novel marine cyanobacterial system of mixed polyketide synthase (PKS)-non-ribosomal peptide synthetase (NRPS) origin involving an unusual trichloroleucyl starter unitGene, 2002
- Analysis and prediction of functional sub-types from protein sequence alignmentsJournal of Molecular Biology, 2000
- T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. ThorntonJournal of Molecular Biology, 2000
- AAindex: Amino Acid index databaseNucleic Acids Research, 2000
- The packing density in proteins: standard radii and volumes 1 1Edited by J. M. ThorntonJournal of Molecular Biology, 1999
- The tyrocidine biosynthesis operon of Bacillus brevis: complete nucleotide sequence and biochemical characterization of functional internal adenylation domainsJournal of Bacteriology, 1997
- Modular Peptide Synthetases Involved in Nonribosomal Peptide SynthesisChemical Reviews, 1997