A probabilistic model for identifying protein names and their name boundaries
- 1 January 2003
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in Proceedings. IEEE Computer Society Bioinformatics Conference
- Vol. 2, 251-258
- https://doi.org/10.1109/csb.2003.1227325
Abstract
This paper proposes a method for identifying protein names in biomedical texts with an emphasis on detecting protein name boundaries. We use a probabilistic model which exploits several surface clues characterizing protein names and incorporates word classes for generalization. In contrast to previously proposed methods, our approach does not rely on natural language processing tools such as part-of-speech taggers and syntactic parsers, so as to reduce processing overhead and the potential number of probabilistic parameters to be estimated. A notion of certainty is also proposed to improve precision for identification. We implemented a protein name identification system based on our proposed method, and evaluated the system on real-world biomedical texts in conjunction with the previous work. The results showed that overall our system performs comparably to the state-of-the-art protein name identification system and that higher performance is achieved for compound names. In addition, it is demonstrated that our system can further improve precision by restricting the system output to those names with high certainties.Keywords
This publication has 12 references indexed in Scilit:
- An approach to protein name extraction using heuristics and a dictionaryProceedings of the American Society for Information Science and Technology, 2003
- Accomplishments and challenges in literature data mining for biologyBioinformatics, 2002
- A multi-level text mining method to extract biological relationshipsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A literature based method for identifying gene-disease connectionsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Notions of correctness when evaluating protein name taggersPublished by Association for Computational Linguistics (ACL) ,2002
- Tuning support vector machines for biomedical named entity recognitionPublished by Association for Computational Linguistics (ACL) ,2002
- GENIES: a natural-language processing system for the extraction of molecular pathways from journal articlesBioinformatics, 2001
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Message Understanding Conference-6Published by Association for Computational Linguistics (ACL) ,1996
- The zero-frequency problem: estimating the probabilities of novel events in adaptive text compressionIEEE Transactions on Information Theory, 1991