Bibliographic attribute extraction from erroneous references based on a statistical model

23 January 2004

proceedings article
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 49-60
https://doi.org/10.1109/jcdl.2003.1204843

Abstract

In this paper, we propose a method for extracting bibliographic attributes from reference strings captured using Optical Character Recognition (OCR) and an extended hidden Markov model. Bibliographic attribute extraction can be used in two ways. One is reference parsing in which attribute values are extracted from OCR-processed references for bibliographic matching. The other is reference alignment in which attribute values are aligned to the bibliographic record to enrich the vocabulary of the bibliographic database. In this paper, we first propose a statistical model for attribute extraction that represents both the syntactical structure of references and OCR error patterns. Then, we perform experiments using bibliographic references obtained from scanned images of papers in journals and transactions and show that useful attribute values are extracted from OCR-processed references. We also show that the proposed model has advantages in reducing the cost of preparing training data, a critical problem in rule-based systems.

Keywords

This publication has 15 references indexed in Scilit:

DVHMM: variable length text recognition error model
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space
Published by Springer Nature ,2002
Digital libraries and autonomous citation indexing
Computer, 1999
Autonomous citation matching
Published by Association for Computing Machinery (ACM) ,1999
CiteSeer
Published by Association for Computing Machinery (ACM) ,1998
A Probabilistic Theory of Pattern Recognition
Published by Springer Nature ,1996
Bibliography references validation using emergent architecture
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1995
Techniques for automatically correcting words in text
ACM Computing Surveys, 1992
A prototype document image analysis system for technical journals
Computer, 1992
An investigation of different string coding methods
Journal of the American Society for Information Science, 1984