Automatic script identification from images using cluster-based templates
- 19 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 1, 378-381
- https://doi.org/10.1109/icdar.1995.599017
Abstract
We describe a system that automatically identifies the script used in documents stored electronically in image form. The system can learn to distinguish any number of scripts. It develops a set of representative symbols (templates) for each script by clustering textual symbols from a set of training documents and representing each cluster by its centroid. "Textual symbols" include discrete characters in scripts such as Cyrillic, as well as adjoined characters, character fragments, and whole words in connected scripts such as Arabic. To identify a new document's script, the system compares a subset of symbols from the document to each script's templates, screening out rare or unreliable templates, and choosing the script whose templates provide the best match. Our current system, trained on thirteen scripts, correctly identifies all test documents except those printed in fonts that differ markedly from fonts in the training set.Keywords
This publication has 5 references indexed in Scilit:
- Stress assignment in letter to sound rules for speech synthesisPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Determination of the script and language content of document imagesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1997
- Text characterization by connected component transformationsPublished by SPIE-Intl Soc Optical Eng ,1994
- Language determinationPublished by Association for Computational Linguistics (ACL) ,1994
- An integrated data flow visual language and software development environmentJournal of Visual Languages & Computing, 1991