Correcting English text using PPM models
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 289-298
- https://doi.org/10.1109/dcc.1998.672157
Abstract
An essential component of many applications in natural language processing is a language modeler able to correct errors in the text being processed. For optical character recognition (OCR), poor scanning quality or extraneous pixels in the image may cause one or more characters to be mis-recognized, while for spelling correction, two characters may be transposed, or a character may be inadvertently inserted or missed out, This paper describes a method for correcting English text using a PPM model. A method that segments words in English text is introduced and is shown to be a significant improvement over previously used methods. A similar technique is also applied as a post-processing stage after pages have been recognized by a state-of-the-art commercial OCR system. We show that the accuracy of the OCR system can be increased from 96.3% to 96.9%, a decrease of about 14 errors per page.Keywords
This publication has 11 references indexed in Scilit:
- The entropy of English using PPM-based modelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Unbounded length contexts for PPMPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- New techniques for context modelingPublished by Association for Computational Linguistics (ACL) ,1995
- A statistical approach to sense disambiguation in machine translationPublished by Association for Computational Linguistics (ACL) ,1991
- Context based spelling correctionInformation Processing & Management, 1991
- SELF-ORGANIZED LANGUAGE MODELING FOR SPEECH RECOGNITIONPublished by Elsevier ,1990
- Data Compression Using Adaptive Coding and Partial String MatchingIEEE Transactions on Communications, 1984
- A Maximum Likelihood Approach to Continuous Speech RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1983
- Prediction and Entropy of Printed EnglishBell System Technical Journal, 1951
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948