Language identification for printed text independent of segmentation

19 November 2002

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Vol. 3, 428-431 vol.3
https://doi.org/10.1109/icip.1995.537663

Abstract

This paper presents efficient algorithms for determining the language classification of machine generated documents without requiring the identification of individual characters. Such algorithms may be useful for sorting and routing of facsimile documents as they arrive so that appropriate routing and secondary analysis, which may include OCR, is selected for each document. It may also prove useful as a component of a content addressable document access system. There have been numerous reported efforts which attempt to segment printed documents into homogeneous regions using Hough transforms, hidden Markov models, morphological filtering, and neural networks. However, language identification can be accomplished without explicit segmentation using the less computationally intensive methods described.

Keywords

This publication has 5 references indexed in Scilit:

Estimation Of Nerve Fiber Loss From Digitized Retinal Images
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Multilevel segmentation and analysis of facsimile images for document classification
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Extraction of text layout structures on document images based on statistical characterization
Published by SPIE-Intl Soc Optical Eng ,1995
Document image decoding using Markov source models
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1993
Finding similar patterns in large image databases
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1993