The identification of variable-length, equifrequent character strings in a natural language data base
Open Access
- 1 March 1972
- journal article
- Published by Oxford University Press (OUP) in The Computer Journal
- Vol. 15 (3) , 259-262
- https://doi.org/10.1093/comjnl/15.3.259
Abstract
The words of natural language texts exhibit a Poisson (or Zipfian) rank-frequency relationship, i.e., a small number of common words accounts for a large proportion of word occurrences, while a large number of the words occur as singletons or only infrequently. Inverted-file retrieval systems using free text data bases commonly identify words as the keys or index terms about which the file is inverted, and through which access is provided. They therefore involve large and growing dictionaries and may entail inefficient utilisation of storage because of the distribution characteristics. An alternative approach may be based on the analysis of text in terms of sets of variable-length character strings, the frequency distributions of which are much less disparate than those of words. This could lead to substantial reductions in dictionary size, and increased efficiency both in dictionary look-up times and storage utilisation.Keywords
This publication has 0 references indexed in Scilit: