Using out-of-domain data to improve in-domain language models
- 1 August 1997
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Signal Processing Letters
- Vol. 4 (8) , 221-223
- https://doi.org/10.1109/97.611282
Abstract
Standard statistical language modeling techniques suffer from sparse data problems when applied to real tasks in speech recognition, where large amounts of domain-dependent text are not available. We investigate new approaches to improve sparse application-specific language models by combining domain dependent and out-of-domain data, including a back-off scheme that effectively leads to context-dependent multiple interpolation weights, and a likelihood-based similarity weighting scheme to discriminatively use data to train a task-specific language model. Experiments with both approaches on a spontaneous speech recognition task (switchboard), lead to reduced word error rate over a domain-specific n-gram language model, giving a larger gain than that obtained with previous brute-force data combination approaches.Keywords
This publication has 6 references indexed in Scilit:
- Statistical language modeling using a small corpus from an application domainPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Combination of word-based and category-based language modelsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Language modeling with sentence-level mixturesPublished by Association for Computational Linguistics (ACL) ,1994
- The estimation of powerful language models from small and large corporaPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1993
- SWITCHBOARD: telephone speech corpus for research and developmentPublished by Institute of Electrical and Electronics Engineers (IEEE) ,1992
- Integration of diverse recognition methodologies through reevaluation of N-best sentence hypothesesPublished by Association for Computational Linguistics (ACL) ,1991