Protein is incompressible
- 1 January 1999
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 10680314,p. 257-266
- https://doi.org/10.1109/dcc.1999.755675
Abstract
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown, that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.Keywords
This publication has 10 references indexed in Scilit:
- Correcting English text using PPM modelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Significantly lower entropy estimates for natural DNA sequencesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Text mining: a new frontier for lossless compressionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1999
- Compression and Explanation using Hierarchical GrammarsThe Computer Journal, 1997
- Bagging predictorsMachine Learning, 1996
- The Swiss-3DImage collection and PDB-Browser on the World-Wide WebTrends in Biochemical Sciences, 1995
- K*: An Instance-based Learner Using an Entropic Distance MeasurePublished by Elsevier ,1995
- Amino acid substitution matrices from protein blocks.Proceedings of the National Academy of Sciences, 1992
- Data Compression Using Adaptive Coding and Partial String MatchingIEEE Transactions on Communications, 1984