Protein is incompressible

1 January 1999

conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 10680314,p. 257-266
https://doi.org/10.1109/dcc.1999.755675

Abstract

Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.

Keywords

This publication has 10 references indexed in Scilit:

Correcting English text using PPM models
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Significantly lower entropy estimates for natural DNA sequences
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Text mining: a new frontier for lossless compression
Published by Institute of Electrical and Electronics Engineers (IEEE) ,1999
Compression and Explanation using Hierarchical Grammars
The Computer Journal, 1997
Bagging predictors
Machine Learning, 1996
The Swiss-3DImage collection and PDB-Browser on the World-Wide Web
Trends in Biochemical Sciences, 1995
K*: An Instance-based Learner Using an Entropic Distance Measure
Published by Elsevier ,1995
Amino acid substitution matrices from protein blocks.
Proceedings of the National Academy of Sciences, 1992
Data Compression Using Adaptive Coding and Partial String Matching
IEEE Transactions on Communications, 1984