Substring selection for biomedical document classification
Open Access
- 23 June 2006
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 22 (17) , 2136-2142
- https://doi.org/10.1093/bioinformatics/btl350
Abstract
Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92–0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86–0.93 range). The proposed approach is particularly useful when labeled datasets are small. Contact:vucetic@ist.temple.edu Supplementary Information: The supplementary data are available fromKeywords
This publication has 15 references indexed in Scilit:
- The Universal Protein Resource (UniProt): an expanding universe of protein informationNucleic Acids Research, 2006
- Mining protein function from text using term-based support vector machinesBMC Bioinformatics, 2005
- Literature mining and database annotation of protein phosphorylation using a rule-based systemBioinformatics, 2005
- iProLINK: an integrated protein resource for literature miningComputational Biology and Chemistry, 2004
- Text Categorization Models for High-Quality Article Retrieval in Internal MedicineJournal of the American Medical Informatics Association, 2004
- Automatic scientific text classification using local patternsACM SIGKDD Explorations Newsletter, 2002
- Rule-based extraction of experimental evidence in the biomedical domainACM SIGKDD Explorations Newsletter, 2002
- Using Linear Algebra for Intelligent Information RetrievalSIAM Review, 1995
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980
- Rational Chebyshev approximations for the error functionMathematics of Computation, 1969