Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

1 December 1990

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Database Systems

Vol. 15 (4) , 483-517
https://doi.org/10.1145/99935.99938

Abstract

A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and related indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.

Keywords

This publication has 17 references indexed in Scilit:

Dynamic cluster maintenance
Information Processing & Management, 1989
Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval
The Computer Journal, 1989
Recent trends in hierarchic document clustering: A critical review
Information Processing & Management, 1988
Term-weighting approaches in automatic text retrieval
Information Processing & Management, 1988
Techniques for the measurement of clustering tendency in document retrieval systems
Journal of Information Science, 1987
Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
Information Processing & Management, 1986
Using interdocument similarity information in document retrieval systems
Journal of the American Society for Information Science, 1986
HIERARCHIC AGGLOMERATIVE CLUSTERING METHODS FOR AUTOMATIC DOCUMENT CLASSIFICATION
Journal of Documentation, 1984
Generation and search of clustered files
ACM Transactions on Database Systems, 1978
A file organization and maintenance procedure for dynamic document collections
Information Processing & Management, 1975