Automatic word classification using simulated annealing
- 1 January 1993
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 2, 41-44 vol.2
- https://doi.org/10.1109/icassp.1993.319224
Abstract
A bigram class model which gives the probability of a word class given its predecessor class has been developed. Simulated annealing is used to classify automatically the words of large text corpora. A first validation of the use of simulated annealing in language modeling is presented. Results are presented using a French corpus of 40000 words and a German corpus of 100000 words. It is demonstrated that simulated annealing makes it possible to classify words without any syntactic or semantic knowledge. The best results are obtained with all words gathered into a unique class at the beginning of the optimization. Simulated annealing is easy to implement and CPU time cost is not prohibitive: seven hours on a 486-33 MHz PC to classify 14000 words into 120 classes using a 75000 word training set, without any code optimization.Keywords
This publication has 7 references indexed in Scilit:
- Forming Word Classes by Statistical Clustering for Statistical Language ModellingPublished by Springer Nature ,1993
- A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigramsComputer Speech & Language, 1991
- On smoothing techniques for bigram-based natural language modellingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1991
- Spin Glass Theory and BeyondPhysics Today, 1988
- Simulated Annealing: Theory and ApplicationsPublished by Springer Nature ,1987
- The development of an experimental discrete dictation recognizerProceedings of the IEEE, 1985
- Equation of State Calculations by Fast Computing MachinesThe Journal of Chemical Physics, 1953