Language Trees and Zipping

  • 31 August 2001
Abstract
In this letter we present a very general method to recognize and classify informations codified as sequences of characters. Based on data-compression techniques, its key point is the computation of the relative entropy between pairs of sequences, interpreted as a distance between them. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, author recognition and language classification.

This publication has 0 references indexed in Scilit: