The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression
- 1 July 1991
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Information Theory
- Vol. 37 (4) , 1085-1094
- https://doi.org/10.1109/18.87000
Abstract
Approaches to the zero-frequency problem in adaptive text compression are discussed. This problem relates to the estimation of the likelihood of a novel event occurring. Although several methods have been used, their suitability has been on empirical evaluation rather than a well-founded model. The authors propose the application of a Poisson process model of novelty. Its ability to predict novel tokens is evaluated, and it consistently outperforms existing methods. It is applied to a practical statistical coding scheme, where a slight modification is required to avoid divergence. The result is a well-founded zero-frequency model that explains observed differences in the performance of existing methods, and offers a small improvement in the coding efficiency of text compression over the best method previously knownKeywords
This publication has 15 references indexed in Scilit:
- Arithmetic coding for data compressionCommunications of the ACM, 1987
- A locally adaptive data compression schemeCommunications of the ACM, 1986
- Data Compression Using Adaptive Coding and Partial String MatchingIEEE Transactions on Communications, 1984
- A general minimum-redundancy source-coding algorithmIEEE Transactions on Information Theory, 1980
- Arithmetic stream coding using fixed precision registersIEEE Transactions on Information Theory, 1979
- Arithmetic CodingIBM Journal of Research and Development, 1979
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977
- Estimating the Number of Unsen Species: How Many Words Did Shakespeare Know?Biometrika, 1976
- Generalized Kraft Inequality and Arithmetic CodingIBM Journal of Research and Development, 1976
- THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASEDBiometrika, 1956