Multi-class composite N-gram based on connection direction

Abstract
A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word, In each side, word classes are merged into larger clusters independently according to preceding or following word distributions. This word-clustering can provide more efficient and statistically reliable word clusters. Further, we extend it to a multi-class composite N-gram that unit is a multi-class 2-gram and joined word. The multi-class composite N-gram showed better performance both in perplexity and recognition rates with one thousandth smaller size than conventional word 2-grams.

This publication has 3 references indexed in Scilit: