COMPARING KEYWORD EXTRACTION TECHNIQUES FOR WEBSOM TEXT ARCHIVES

1 June 2002

journal article
research article
Published by World Scientific Pub Co Pte Ltd in International Journal on Artificial Intelligence Tools

Vol. 11 (02) , 219-232
https://doi.org/10.1142/s0218213002000861

Abstract

The WEBSOM methodology for building very large text archives has a very slow method for extracting meaningful unit labels. This is due to the fact that the method computes for the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of other units in the map. Since maps may have more than 100,000 units and the archieve may contain up to 7 million documents, the existing WEBSOM method is not practical. A fast alternative method, referred to as the liGHtSOM method, is based on the distribution of weights in the weight vectors of the trained map, plus a simple manipulation of the random projection matrix used for input data compression. Comparison made using a WEBSOM archieve of the Reuters text collection reveal that a high percentage of keywords extracted using this method match the keywords extracted for the same map units using the original WEBSOM method. A detailed time complexity analysis of the two methods is also provided.

Keywords

This publication has 1 reference indexed in Scilit:

An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval Journal, 1999