Direct pattern matching on compressed text
- 27 November 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
We present a fast compression and decompression technique for natural language texts. The novelty is that the exact search can be done on the compressed text directly, using any known sequential pattern matching algorithm. Approximate search can also be done efficiently without any decoding. The compression scheme uses a semi static word based modeling and a Huffman coding where the coding alphabet is byte oriented rather than bit oriented. We use the first bit of each byte to mark the beginning of a word, which allows the searching of the compressed pattern directly on the compressed text. We achieve about 33% compression ratio for typical English texts. When searching for simple patterns, our experiments show that running our algorithm on a compressed text is almost twice as fast as running agrep on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithm is up to 8 times faster than agrep.Keywords
This publication has 13 references indexed in Scilit:
- Efficient two-dimensional compressed matchingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A text compression scheme that allows fast searching directly in the compressed fileACM Transactions on Information Systems, 1997
- Multiple approximate string matchingPublished by Springer Nature ,1997
- Let Sleeping Files Lie: Pattern Matching in Z-Compressed FilesJournal of Computer and System Sciences, 1996
- A faster algorithm for approximate string matchingPublished by Springer Nature ,1996
- Fast text searchingCommunications of the ACM, 1992
- A very fast substring search algorithmCommunications of the ACM, 1990
- Word‐based text compressionSoftware: Practice and Experience, 1989
- A locally adaptive data compression schemeCommunications of the ACM, 1986
- A Method for the Construction of Minimum-Redundancy CodesProceedings of the IRE, 1952