Language Trees and Zipping

Preprint

31 August 2001

preprint
Published by arXiv in arXiv

Abstract

In this letter we present a very general method to recognize and classify informations codified as sequences of characters. Based on data-compression techniques, its key point is the computation of the relative entropy between pairs of sequences, interpreted as a distance between them. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, author recognition and language classification.

Keywords

LANGUAGE RECOGNITION
DATA COMPRESSION
CLASSIFICATION
LANGUAGE TREES
LINGUISTIC
MOTIVATED
ZIPPING
TREES

All Related Versions

Version 1, 2001-08-31, ArXiv
Version 2, 2001-12-19, ArXiv
Published version: Physical Review Letters, 88 (4), 048702.

This publication has 0 references indexed in Scilit: