Determination of the script and language content of document images
- 1 March 1997
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 19 (3) , 235-245
- https://doi.org/10.1109/34.584100
Abstract
Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly available and adds utility to such systems. Languages and their scripts have attributes that make it possible to determine the language of a document automatically. Detection of the values of these attributes requires the recognition of particular features of the document image and, in the case of languages using Latin-based symbols, the character syntax of the underlying language. We have developed techniques for distinguishing which language is represented in an image of text. This work is restricted to a small but important subset of the world's languages. The method first classifies the script into two broad classes: Han-based and Latin-based. This classification is based on the spatial relationships of features related to the upward concavities in character structures. Language identification within the Han script class (Chinese, Japanese, Korean) is performed by analysis of the distribution of optical density in the text images. We handle 23 Latin-based languages using a technique based on character shape codes, a representation of Latin text that is inexpensive to compute.Keywords
This publication has 9 references indexed in Scilit:
- World image matching as a technique for degraded text recognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- European language determination from imagePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Automatic script identification from images using cluster-based templatesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Language identification for printed text independent of segmentationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Determination of the script and language content of document imagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1997
- Text characterization by connected component transformationsPublished by SPIE-Intl Soc Optical Eng ,1994
- Language determinationPublished by Association for Computational Linguistics (ACL) ,1994
- Document analysis-from pixels to contentsProceedings of the IEEE, 1992
- International digital facsimile coding standardsProceedings of the IEEE, 1980