Statistical Evaluation of Local Alignment Features Predicting Allergenicity Using Supervised Classification Algorithms
- 1 February 2004
- journal article
- research article
- Published by S. Karger AG in International Archives of Allergy and Immunology
- Vol. 133 (2) , 101-112
- https://doi.org/10.1159/000076382
Abstract
Recently, two promising alignment-based features predicting food allergenicity using the k nearest neighbor (kNN) classifier were reported. These features are the alignment score and alignment length of the best local alignment obtained in a database of known allergen sequences. In the work reported here a much more comprehensive statistical evaluation of the potential of these features was performed, this time for the prediction of allergenicity in general. The evaluation consisted of the following four key components. (1) A new high quality database consisting of 318 carefully selected, non-redundant allergens and 1,007 sequences carefully selected to be non-allergens. (2) Three different supervised algorithms: the kNN classifier, the Bayesian linear Gaussian classifier, and the Bayesian quadratic Gaussian classifier. (3) A large set of local alignment procedures defined using the FASTA3 alignment program by means of a wide range of different parameter settings. (4) Novel performance curves, alternative to conventional receiver-operating characteristic curves, to display not only average behaviors but also statistical variations due to small data sets. The linear Gaussian classifier proved most useful among the tested supervised machine learning algorithms, closely followed by the quadratic Gaussian equivalent and kNN. The overall best classification results were obtained with a novel feature vector consisting of the combined alignment scores derived from local alignment procedures using different substitution matrices. The models reported here should be useful as a part of an integrated assessment scheme for potential protein allergenicity and for future comparisons with alternative bioinformatic approaches.Keywords
This publication has 13 references indexed in Scilit:
- Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE – binding linear epitopes of allergensBMC Structural Biology, 2002
- Report on the potential allergenicity of genetically modified organisms and their productsClinical and Experimental Allergy, 2002
- Estimating and Evaluating the Statistics of Gapped Local-Alignment ScoresJournal of Computational Biology, 2002
- Amino Acid Substitution Matrices from an Artificial Neural Network ModelJournal of Computational Biology, 2001
- Structural biology of allergensJournal of Allergy and Clinical Immunology, 2000
- Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methodsJournal of Molecular Biology, 1998
- New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino AcidsJournal of Medicinal Chemistry, 1998
- Intermediate sequences increase the detection of homology between sequencesJournal of Molecular Biology, 1997
- A population study of food intoleranceThe Lancet, 1994
- Hidden Markov Models in Computational BiologyJournal of Molecular Biology, 1994