Unsupervised query segmentation using generative language models and wikipedia

21 April 2008

proceedings article
Published by Association for Computing Machinery (ACM)

p. 347-356
https://doi.org/10.1145/1367497.1367545

Abstract

In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia. Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).

Keywords

This publication has 15 references indexed in Scilit:

Query reformulation using automatically generated query concepts from a document space
Information Processing & Management, 2006
A picture of search
Published by Association for Computing Machinery (ACM) ,2006
Statistical Recognition of Noun Phrases in Unrestricted Text
Published by Springer Nature ,2005
Mostly-unsupervised statistical segmentation of Japanese kanji sequences
Natural Language Engineering, 2003
Phase-based information retrieval
Information Processing & Management, 1998
Exploiting clustering and phrases for context-based information retrieval
ACM SIGIR Forum, 1997
Fast statistical parsing of noun phrases for document indexing
Published by Association for Computational Linguistics (ACL) ,1997
Distributional regularity and phonotactic constraints are useful for segmentation
Cognition, 1996
Clustering words with the MDL principle
Published by Association for Computational Linguistics (ACL) ,1996
Noun-phrase analysis in unrestricted text for information retrieval
Published by Association for Computational Linguistics (ACL) ,1996