Robust retrieval of noisy text
- 23 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
We examine the effects of simulated OCR errors on Boolean query models for information retrieval. We show that even relatively small amounts of such noise can have a significant impact. To address this issue, we formulate new variants of the traditional models by combining two classic paradigms for dealing with imprecise data: approximate string matching and fuzzy logic. Using a recall/precision analysis of an experiment involving nearly 60 million query evaluations, we demonstrate that the new fuzzy retrieval methods are generally more robust than their "sharp" counterparts.Keywords
This publication has 8 references indexed in Scilit:
- Post-processing of OCR results for automatic indexingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Clustering OCR-ed texts for browsing document image databasePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- A sublinear algorithm for approximate keyword searchingAlgorithmica, 1994
- Results of Applying Probabilistic IR to OCR TextPublished by Springer Nature ,1994
- The effects of noisy data on text retrievalJournal of the American Society for Information Science, 1994
- The String-to-String Correction ProblemJournal of the ACM, 1974
- A general method applicable to the search for similarities in the amino acid sequence of two proteinsJournal of Molecular Biology, 1970
- Fuzzy setsInformation and Control, 1965