An Analysis of Some Graph Theoretical Cluster Techniques
- 1 October 1970
- journal article
- Published by Association for Computing Machinery (ACM) in Journal of the ACM
- Vol. 17 (4) , 571-588
- https://doi.org/10.1145/321607.321608
Abstract
Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T , are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters. Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar. Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner. For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.Keywords
This publication has 8 references indexed in Scilit:
- Semantic Clustering of Index TermsJournal of the ACM, 1968
- A clustering experiment: First step towards a computer-generated classification schemeInformation Storage and Retrieval, 1968
- Automatic indexing :Published by National Institute of Standards and Technology (NIST) ,1965
- On Some Clustering TechniquesIBM Journal of Research and Development, 1964
- Information Retrieval Based upon Latent Class AnalysisJournal of the ACM, 1962
- Concerning the Possibility of a Cooperative Information Exchange [Letter to the Editor]IBM Journal of Research and Development, 1962
- The Association Factor in Information RetrievalJournal of the ACM, 1961
- A Computer Program for Classifying PlantsScience, 1960