PREDICTING THE SUB-CELLULAR LOCATION OF PROTEINS FROM TEXT USING SUPPORT VECTOR MACHINES
- 1 December 2001
- proceedings article
- Published by World Scientific Pub Co Pte Ltd
Abstract
We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S.cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.Keywords
This publication has 0 references indexed in Scilit: