Compression of continuous prose texts using variety generation

1 May 1980

journal article
Published by Wiley in Journal of the American Society for Information Science

Vol. 31 (3) , 201-207
https://doi.org/10.1002/asi.4630310312

Abstract

The use of variety‐generation techniques for text compression depends on the selection of symbol sets, or sets of variable‐length character strings occurring approximately equifrequently in the text in question. In order that the method perform efficiently in a variety of situations, the symbol set must be reasonably independent of the particular text used in its generation. Hence, texts of different origins must be similar in their microstructure for the technique to work well. Texts of American English varying in subject and style have been found to fulfill this condition. On average the texts can be represented with a space saving of just over 50% on the space used by a fixed‐length 8‐bit representation of the characters, and the best results are obtained using a symbol set generated from a sample of the complete data base, although results from subsets of the data base are almost as good.

Keywords

This publication has 13 references indexed in Scilit:

The stability of symbol sets produced by variety generation from bibliographic data
Program: electronic library and information systems, 1978
Text compression with an Associative Parallel Processor
The Computer Journal, 1978
Variety generation—A reinterpretation of Shannon's mathematical theory of communication, and its implications for information science
Journal of the American Society for Information Science, 1977
A comparison of algorithms for data base compression by use of fragments as language elements
Information Storage and Retrieval, 1974
Zipf's law and entropy (Corresp.)
IEEE Transactions on Information Theory, 1974
Selection of equifrequent word fragments for information retrieval
Information Storage and Retrieval, 1973
The identification of variable-length, equifrequent character strings in a natural language data base
The Computer Journal, 1972
A note on the entropy of words in printed English
Information and Control, 1964
Prediction and Entropy of Printed English
Bell System Technical Journal, 1951
A Mathematical Theory of Communication
Bell System Technical Journal, 1948