The identification of variable-length, equifrequent character strings in a natural language data base

Open Access

1 March 1972

journal article
Published by Oxford University Press (OUP) in The Computer Journal

Vol. 15 (3) , 259-262
https://doi.org/10.1093/comjnl/15.3.259

Abstract

The words of natural language texts exhibit a Poisson (or Zipfian) rank-frequency relationship, i.e., a small number of common words accounts for a large proportion of word occurrences, while a large number of the words occur as singletons or only infrequently. Inverted-file retrieval systems using free text data bases commonly identify words as the keys or index terms about which the file is inverted, and through which access is provided. They therefore involve large and growing dictionaries and may entail inefficient utilisation of storage because of the distribution characteristics. An alternative approach may be based on the analysis of text in terms of sets of variable-length character strings, the frequency distributions of which are much less disparate than those of words. This could lead to substantial reductions in dictionary size, and increased efficiency both in dictionary look-up times and storage utilisation.

Keywords

NATURAL LANGUAGE

This publication has 0 references indexed in Scilit: