A variable-length category-based n-gram language model
- 24 December 2002
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 1, 164-167 vol. 1
- https://doi.org/10.1109/icassp.1996.540316
Abstract
A language model based on word-category n-grams and ambiguous category membership with n increased selectively to trade compactness for performance is presented. The use of categories leads intrinsically to a compact model with the ability to generalise to unseen word sequences, and diminishes the sparseness of the training data, thereby making larger n feasible. The language model implicitly involves a statistical tagging operation, which may be used explicitly to assign category assignments to untagged text. Experiments on the LOB corpus show the optimal model-building strategy to yield improved results with respect to conventional n-gram methods, and when used as a tagger, the model is seen to perform well in relation to a standard benchmark.Keywords
This publication has 4 references indexed in Scilit:
- On structuring probabilistic dependences in stochastic language modellingComputer Speech & Language, 1994
- A tree-based statistical language model for natural language speech recognitionIEEE Transactions on Acoustics, Speech, and Signal Processing, 1989
- Estimation of probabilities from sparse data for the language model component of a speech recognizerIEEE Transactions on Acoustics, Speech, and Signal Processing, 1987
- Communication theory--Exposition of fundamentalsTransactions of the IRE Professional Group on Information Theory, 1953