PREDICTING THE SUB-CELLULAR LOCATION OF PROTEINS FROM TEXT USING SUPPORT VECTOR MACHINES

1 December 2001

proceedings article
Published by World Scientific Pub Co Pte Ltd

p. 374-85
https://doi.org/10.1142/9789812799623_0035

Abstract

We present an automatic method to classify the sub-cellular location of proteins based on the text of relevant medline abstracts. For each protein, a vector of terms is generated from medline abstracts in which the protein/gene's name or synonym occurs. A Support Vector Machine (SVM) is used to automatically partition the term space and to thus discriminate the textual features that define sub-cellular location. The method is benchmarked on a set of proteins of known sub-cellular location from S.cerevisiae. No prior knowledge of the problem domain nor any natural language processing is used at any stage. The method out-performs support vector machines trained on amino acid composition and has comparable performance to rule-based text classifiers. Combining text with protein amino-acid composition improves recall for some sub-cellular locations. We discuss the generality of the method and its potential application to a variety of biological classification problems.

Keywords

This publication has 0 references indexed in Scilit: