Abstract
An important factor in the performance of information retrieval systems is the choice of classification scheme, which should ideally assign an equal number of documents to each retrieval key. The concept of relative entropy, taken from information theory, has been used in the literature as a criterion of equifrequency but a mathematical basis for the methods used to generate index terms has not been presented. This paper considers the general problem of grouping a collection of objects of known frequencies in order to balance the frequencies of the resulting sets, and presents mathematical criteria for increasing the balance of a grouping by re-arranging the sets. This leads to a method for monotonically increasing the relative entropy of the collection of objects by a sequence of multiway splitting or coalescing steps. The theory is applied to the threshold method used by Lynch to generate equifrequent sets. It is shown that, for typical distributions, some steps in the threshold method will decrease the balance and furthermore, for some distributions, the threshold method will give very poor results.

This publication has 0 references indexed in Scilit: