Compression of continuous prose texts using variety generation

Abstract
The use of variety‐generation techniques for text compression depends on the selection of symbol sets, or sets of variable‐length character strings occurring approximately equifrequently in the text in question. In order that the method perform efficiently in a variety of situations, the symbol set must be reasonably independent of the particular text used in its generation. Hence, texts of different origins must be similar in their microstructure for the technique to work well. Texts of American English varying in subject and style have been found to fulfill this condition. On average the texts can be represented with a space saving of just over 50% on the space used by a fixed‐length 8‐bit representation of the characters, and the best results are obtained using a symbol set generated from a sample of the complete data base, although results from subsets of the data base are almost as good.

This publication has 13 references indexed in Scilit: