COMPARING KEYWORD EXTRACTION TECHNIQUES FOR WEBSOM TEXT ARCHIVES

Abstract
The WEBSOM methodology for building very large text archives has a very slow method for extracting meaningful unit labels. This is due to the fact that the method computes for the relative frequencies of all the words of all the documents associated to each unit and then compares these to the relative frequencies of all the words of other units in the map. Since maps may have more than 100,000 units and the archieve may contain up to 7 million documents, the existing WEBSOM method is not practical. A fast alternative method, referred to as the liGHtSOM method, is based on the distribution of weights in the weight vectors of the trained map, plus a simple manipulation of the random projection matrix used for input data compression. Comparison made using a WEBSOM archieve of the Reuters text collection reveal that a high percentage of keywords extracted using this method match the keywords extracted for the same map units using the original WEBSOM method. A detailed time complexity analysis of the two methods is also provided.

This publication has 1 reference indexed in Scilit: