Robust retrieval of noisy text

23 December 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 76-85
https://doi.org/10.1109/adl.1996.502518

Abstract

We examine the effects of simulated OCR errors on Boolean query models for information retrieval. We show that even relatively small amounts of such noise can have a significant impact. To address this issue, we formulate new variants of the traditional models by combining two classic paradigms for dealing with imprecise data: approximate string matching and fuzzy logic. Using a recall/precision analysis of an experiment involving nearly 60 million query evaluations, we demonstrate that the new fuzzy retrieval methods are generally more robust than their "sharp" counterparts.

Keywords

This publication has 8 references indexed in Scilit:

Post-processing of OCR results for automatic indexing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Clustering OCR-ed texts for browsing document image database
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
A sublinear algorithm for approximate keyword searching
Algorithmica, 1994
Results of Applying Probabilistic IR to OCR Text
Published by Springer Nature ,1994
The effects of noisy data on text retrieval
Journal of the American Society for Information Science, 1994
The String-to-String Correction Problem
Journal of the ACM, 1974
A general method applicable to the search for similarities in the amino acid sequence of two proteins
Journal of Molecular Biology, 1970
Fuzzy sets
Information and Control, 1965