A sentence sliding window approach to extract protein annotations from biomedical articles
Open Access
- 24 May 2005
- journal article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 6 (S1) , S19
- https://doi.org/10.1186/1471-2105-6-s1-s19
Abstract
Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications.Keywords
This publication has 27 references indexed in Scilit:
- BioCreAtIvE Task 1A: gene mention finding evaluationBMC Bioinformatics, 2005
- Overview of BioCreAtIvE task 1B: normalized gene listsBMC Bioinformatics, 2005
- Evaluation of BioCreAtIvE assessment of task 2BMC Bioinformatics, 2005
- An evaluation of GO annotation retrieval for BioCreAtIvE and GOABMC Bioinformatics, 2005
- The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003Nucleic Acids Research, 2003
- Large-Scale Protein Annotation through Gene OntologyGenome Research, 2002
- Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical LiteratureGenome Research, 2002
- Intrinsic errors in genome annotationTrends in Genetics, 2001
- A literature network of human genes for high-throughput analysis of gene expressionNature Genetics, 2001
- An algorithm for suffix strippingProgram: electronic library and information systems, 1980