Clustering words with the MDL principle
- 1 January 1996
- proceedings article
- Published by Association for Computational Linguistics (ACL)
- Vol. 1, 4-9
- https://doi.org/10.3115/992628.992633
Abstract
We address the problem of automatically constructing a thesaurus by clustering words based on corpus data. We view this problem as that of estimating a joint distribution over the Cartesian product of a partition of a set of nouns and a partition of a set of verbs, and propose a learning algorithm based on the Minimum Description Length (MDL) Principle for such estimation. We empirically compared the performance of our method based on the MDL Principle against the Maximum Likelihood Estimator in word clustering, and found that the former outperforms the latter. We also evaluated the method by conducting pp-attachment disambiguation experiments using an automatically constructed thesaurus. Our experimental results indicate that such a thesaurus can be used to improve accuracy in disambiguation.Keywords
All Related Versions
This publication has 0 references indexed in Scilit: