Data compression using long common strings

1 January 1999

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 287-295
https://doi.org/10.1109/dcc.1999.755678

Abstract

We describe a precompression algorithm that effectively represents any long common strings that appear in a file. The algorithm interacts well with standard compression algorithms that represent shorter strings that are near in the input text. Our experiments show that some real data sets do indeed contain many long common strings. We extend the fingerprint mechanisms of our algorithm to a program that identifies long common strings in an input file. This program gives interesting insights into the structure of real data files that contain long common strings.

Keywords

This publication has 11 references indexed in Scilit:

A corpus for the evaluation of lossless compression algorithms
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
On finding duplication and near-duplication in large software systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Unbounded Length Contexts for PPM
The Computer Journal, 1997
Suffix Arrays: A New Method for On-Line String Searches
SIAM Journal on Computing, 1993
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development, 1987
Data compression via textual substitution
Journal of the ACM, 1982
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory, 1978
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory, 1977
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM, 1976
Printed english compression by dictionary encoding
Proceedings of the IEEE, 1967