Gene/protein name recognition based on support vector machine using dictionary as features
Open Access
- 24 May 2005
- journal article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 6 (S1) , S8
- https://doi.org/10.1186/1471-2105-6-s1-s8
Abstract
Background Automated information extraction from biomedical literature is important because a vast amount of biomedical literature has been published. Recognition of the biomedical named entities is the first step in information extraction. We developed an automated recognition system based on the SVM algorithm and evaluated it in Task 1.A of BioCreAtIvE, a competition for automated gene/protein name recognition. Results In the work presented here, our recognition system uses the feature set of the word, the part-of-speech (POS), the orthography, the prefix, the suffix, and the preceding class. We call these features "internal resource features", i.e., features that can be found in the training data. Additionally, we consider the features of matching against dictionaries to be external resource features. We investigated and evaluated the effect of these features as well as the effect of tuning the parameters of the SVM algorithm. We found that the dictionary matching features contributed slightly to the improvement in the performance of the f-score. We attribute this to the possibility that the dictionary matching features might overlap with other features in the current multiple feature setting. Conclusion During SVM learning, each feature alone had a marginally positive effect on system performance. This supports the fact that the SVM algorithm is robust on the high dimensionality of the feature vector space and means that feature selection is not required.Keywords
This publication has 13 references indexed in Scilit:
- Boundary Correction of Protein Names Adapting Heuristic RulesPublished by Springer Nature ,2004
- Extraction of protein interaction information from unstructured text using a context-free grammarBioinformatics, 2003
- Bio-medical entity extraction using Support Vector MachinesPublished by Association for Computational Linguistics (ACL) ,2003
- Boosting precision and recall of dictionary-based protein name recognitionPublished by Association for Computational Linguistics (ACL) ,2003
- Two-phase biomedical NE recognition based on SVMsPublished by Association for Computational Linguistics (ACL) ,2003
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- Protein names and how to find themInternational Journal of Medical Informatics, 2002
- Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case studyComparative and Functional Genomics, 2001
- Automated extraction of information on protein–protein interactions from the biological literatureBioinformatics, 2001
- Toward information extraction: identifying protein names from biological papers.1998