Some Characteristic Curves for Dictionary Organization with Digital Search
- 1 May 1987
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Systems, Man, and Cybernetics
- Vol. 17 (3) , 520-527
- https://doi.org/10.1109/TSMC.1987.4309070
Abstract
Some parameters that characterize the natural language text behavior for typical structuring of a dictionary have been defined. The typical structure of the dictionary considered here is based on trie structure employing digital search. Digital search is well-suited for applications like optical character recognition. The dictionary uses three partitions in its structure. The first partition carries most frequently used words completely represented in trie structure in the main memory. In the second partition only a part of word is stored in trie structure and the rest is stored in suitable tail structures also in the main memory. In the third partition, a still smaller part of the word is held in trie structure, and the rest of the words are held in files on secondary storage device. The parameters defined are the trie nonutility factor, giving a measure of effectiveness of trie structure; the streaming factor, giving a measure of the common part that exist at the beginning of the word; the node-utilization factor, giving a measure of the extent to which multilink node structure is suited as trie node; and the dispersion factor, giving a measure of the average number of elements in the tail structures. The plots of these parameters act as characteristic curves, much like device characteristics curves, which help the designer to evaluate his choice in terms of memory requirements. These characteristic curves have been obtained using the most commonly cited collection for English text called the Brown Corpus.Keywords
This publication has 5 references indexed in Scilit:
- An Integrated Algorithm for Text Recognition: Comparison with a Cascaded AlgorithmPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1983
- Integrating diverse knowledge sources in text recognitionACM Transactions on Information Systems, 1983
- Development of a Spelling ListIEEE Transactions on Communications, 1982
- Computer programs for detecting and correcting spelling errorsCommunications of the ACM, 1980
- n-Gram Statistics for Natural Language Understanding and Text ProcessingIEEE Transactions on Pattern Analysis and Machine Intelligence, 1979