Data compression using long common strings
- 1 January 1999
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 287-295
- https://doi.org/10.1109/dcc.1999.755678
Abstract
We describe a precompression algorithm that effectively represents any long common strings that appear in a file. The algorithm interacts well with standard compression algorithms that represent shorter strings that are near in the input text. Our experiments show that some real data sets do indeed contain many long common strings. We extend the fingerprint mechanisms of our algorithm to a program that identifies long common strings in an input file. This program gives interesting insights into the structure of real data files that contain long common strings.Keywords
This publication has 11 references indexed in Scilit:
- A corpus for the evaluation of lossless compression algorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- On finding duplication and near-duplication in large software systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Unbounded Length Contexts for PPMThe Computer Journal, 1997
- Suffix Arrays: A New Method for On-Line String SearchesSIAM Journal on Computing, 1993
- Efficient randomized pattern-matching algorithmsIBM Journal of Research and Development, 1987
- Data compression via textual substitutionJournal of the ACM, 1982
- Compression of individual sequences via variable-rate codingIEEE Transactions on Information Theory, 1978
- A universal algorithm for sequential data compressionIEEE Transactions on Information Theory, 1977
- A Space-Economical Suffix Tree Construction AlgorithmJournal of the ACM, 1976
- Printed english compression by dictionary encodingProceedings of the IEEE, 1967