Compression of continuous prose texts using variety generation
- 1 May 1980
- journal article
- Published by Wiley in Journal of the American Society for Information Science
- Vol. 31 (3) , 201-207
- https://doi.org/10.1002/asi.4630310312
Abstract
The use of variety‐generation techniques for text compression depends on the selection of symbol sets, or sets of variable‐length character strings occurring approximately equifrequently in the text in question. In order that the method perform efficiently in a variety of situations, the symbol set must be reasonably independent of the particular text used in its generation. Hence, texts of different origins must be similar in their microstructure for the technique to work well. Texts of American English varying in subject and style have been found to fulfill this condition. On average the texts can be represented with a space saving of just over 50% on the space used by a fixed‐length 8‐bit representation of the characters, and the best results are obtained using a symbol set generated from a sample of the complete data base, although results from subsets of the data base are almost as good.Keywords
This publication has 13 references indexed in Scilit:
- The stability of symbol sets produced by variety generation from bibliographic dataProgram: electronic library and information systems, 1978
- Text compression with an Associative Parallel ProcessorThe Computer Journal, 1978
- Variety generation—A reinterpretation of Shannon's mathematical theory of communication, and its implications for information scienceJournal of the American Society for Information Science, 1977
- A comparison of algorithms for data base compression by use of fragments as language elementsInformation Storage and Retrieval, 1974
- Zipf's law and entropy (Corresp.)IEEE Transactions on Information Theory, 1974
- Selection of equifrequent word fragments for information retrievalInformation Storage and Retrieval, 1973
- The identification of variable-length, equifrequent character strings in a natural language data baseThe Computer Journal, 1972
- A note on the entropy of words in printed EnglishInformation and Control, 1964
- Prediction and Entropy of Printed EnglishBell System Technical Journal, 1951
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948