Unsupervised query segmentation using generative language models and wikipedia
- 21 April 2008
- proceedings article
- Published by Association for Computing Machinery (ACM)
- p. 347-356
- https://doi.org/10.1145/1367497.1367545
Abstract
In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia. Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).Keywords
This publication has 15 references indexed in Scilit:
- Query reformulation using automatically generated query concepts from a document spaceInformation Processing & Management, 2006
- A picture of searchPublished by Association for Computing Machinery (ACM) ,2006
- Statistical Recognition of Noun Phrases in Unrestricted TextPublished by Springer Nature ,2005
- Mostly-unsupervised statistical segmentation of Japanese kanji sequencesNatural Language Engineering, 2003
- Phase-based information retrievalInformation Processing & Management, 1998
- Exploiting clustering and phrases for context-based information retrievalACM SIGIR Forum, 1997
- Fast statistical parsing of noun phrases for document indexingPublished by Association for Computational Linguistics (ACL) ,1997
- Distributional regularity and phonotactic constraints are useful for segmentationCognition, 1996
- Clustering words with the MDL principlePublished by Association for Computational Linguistics (ACL) ,1996
- Noun-phrase analysis in unrestricted text for information retrievalPublished by Association for Computational Linguistics (ACL) ,1996