Language Trees and Zipping
Top Cited Papers
- 8 January 2002
- journal article
- research article
- Published by American Physical Society (APS) in Physical Review Letters
- Vol. 88 (4) , 048702
- https://doi.org/10.1103/physrevlett.88.048702
Abstract
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification.Keywords
All Related Versions
This publication has 7 references indexed in Scilit:
- Universal schemes for sequential decision from individual data sequencesIEEE Transactions on Information Theory, 1993
- Compression of individual sequences via variable-rate codingIEEE Transactions on Information Theory, 1978
- On the Length of Programs for Computing Finite Binary SequencesJournal of the ACM, 1966
- A formal theory of inductive inference. Part IIInformation and Control, 1964
- A formal theory of inductive inference. Part IInformation and Control, 1964
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948
- A Mathematical Theory of CommunicationBell System Technical Journal, 1948