When indexing equals compression

1 October 2006

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Algorithms

Vol. 2 (4) , 611-639
https://doi.org/10.1145/1198513.1198521

Abstract

We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size---without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows--Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.

Keywords

This publication has 24 references indexed in Scilit:

Indexing compressed text
Journal of the ACM, 2005
Boosting textual compression in optimal linear time
Journal of the ACM, 2005
The Level Ancestor Problem simplified
Theoretical Computer Science, 2004
New text indexing functionalities of the compressed suffix arrays
Journal of Algorithms, 2003
Burrows–Wheeler compression with variable length integer codes
Software: Practice and Experience, 2002
Space Efficient Suffix Trees
Journal of Algorithms, 2001
Arithmetic coding revisited
ACM Transactions on Information Systems, 1998
Fractional cascading: I. A data structuring technique
Algorithmica, 1986
A locally adaptive data compression scheme
Communications of the ACM, 1986
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM, 1976