Dragon Toolkit: Incorporating Auto-Learned Semantic Knowledge into Large-Scale Text Retrieval and Mining
- 1 October 2007
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- Vol. 2 (10823409) , 197-201
- https://doi.org/10.1109/ictai.2007.117
Abstract
The majority of text retrieval and mining techniques are still based on exact feature (e.g. words) matching and unable to incorporate text semantics. Many researchers believe that the extension with semantic knowledge could improve the results and various methods (most of them are heuristic) have been proposed to account for concept hierarchy, synonymy, and other semantic relationships. However, the results with such semantic extension have been mixed, ranging from slight improvements to decreases in effectiveness, mostly likely due to the lack of a formal framework. Instead, we propose a novel method to address the semantic extension within the framework of language modeling. Our method extracts explicit topic signatures from documents and then statistically maps them into single- word features. The incorporation of semantic knowledge then reduces to the smoothing of unigram language models using semantic knowledge. The dragon toolkit reflects our method and its effectiveness is demonstrated by three tasks, text retrieval, text classification, and text clustering.Keywords
This publication has 14 references indexed in Scilit:
- LDA-based document models for ad-hoc retrievalPublished by Association for Computing Machinery (ACM) ,2006
- Word sense disambiguation in queriesPublished by Association for Computing Machinery (ACM) ,2005
- Text categorization by boosting automatically extracted conceptsPublished by Association for Computing Machinery (ACM) ,2003
- Two-stage language models for information retrievalPublished by Association for Computing Machinery (ACM) ,2002
- A study of smoothing methods for language models applied to Ad Hoc information retrievalPublished by Association for Computing Machinery (ACM) ,2001
- Concept Decompositions for Large Sparse Text Data Using ClusteringMachine Learning, 2001
- Text Classification from Labeled and Unlabeled Documents using EMMachine Learning, 2000
- A re-examination of text categorization methodsPublished by Association for Computing Machinery (ACM) ,1999
- Information retrieval as statistical translationPublished by Association for Computing Machinery (ACM) ,1999
- Word Sense Disambiguation and Information RetrievalPublished by Springer Nature ,1994